Profile Hidden Markov Models#
- class pyhmmer.plan7.HMM#
A data structure storing the Plan7 Hidden Markov Model.
Changed in version 0.4.6: Added the
evalue_parametersandcutoffsattributes.- sample(alphabet, M, randomness=None, ungapped=False, enumerable=False)#
Sample an HMM of length
Mat random.- Parameters:
alphabet (
Alphabet) – The alphabet of the model.M (
int) – The length of the model to generate (i.e. the number of nodes).randomness (
Randomness,intorNone) – The random number generator to use for sampling, or a seed to initialize a generator. IfNoneor0given, create a new random number generator with a random seed.ungapped (
bool) – Set toTrueto build an ungapped HMM, i.e. an HMM where the \(M_n o M_{n+1}\) are all one and the remaining transitions are zero. Ignored whenenumerableisTrue.enumerable (
bool) – Set toTrueto build a random HMM with no nonzero insertion transitions.
- Returns:
HMM– A new HMM generated at random.
Example
Generate a random, ungapped HMM for the DNA alphabet, using a random number generator initialized with a fixed seed, with 100 nodes:
>>> dna = easel.Alphabet.dna() >>> rng = easel.Randomness(42) >>> hmm = HMM.sample(dna, M=100, randomness=rng, ungapped=True)
Hint
This constructor is only useful for testing and should not be used in production code.
Added in version 0.7.0.
- emit_sequence(randomness=None)#
Emit a sequence from a core HMM.
- Parameters:
randomness (
Randomness,intorNone) – The random number generator to use for sampling, or a seed to initialize a generator. IfNoneor0given, create a new random number generator with a random seed.- Returns:
DigitalSequence– A sequence in digital mode sampled from the HMM. Only thesequencefield will be initialized.
Example
Generate a random sequence with a fixed seed:
>>> seq = thioesterase.emit_sequence(randomness=42) >>> seq.alphabet.decode(seq.sequence) 'RALFFLHPGTGAVGCYSQLADE...'
Generate a sequence from a given random number generator:
>>> rng = easel.Randomness(123) >>> seq1 = thioesterase.emit_sequence(rng) >>> seq2 = thioesterase.emit_sequence(rng) >>> seq1.sequence != seq2.sequence True
Note
By default, the
hmmemitexecutable uses the linear congruential generator, corresponding to aRandomnesscreated withfast=True. If either a seed orNoneis passed as therandomnessargument, the newRandomnesswill be created withfast=Trueas well.See also
Profile.emit_sequence, which allows generating random sequences from an implicit search profile rather than from the core model.HMM.emit_alignment, which generates an alignment of random sequences emitted from the core model rather than single individual sequences.
Added in version 0.12.1.
- emit_alignment(N, randomness=None)#
Emit a sequence from a core HMM.
- Parameters:
N (
int) – The number of sequences (i.e. rows) to generate for the alignment.randomness (
Randomness,intorNone) – The random number generator to use for sampling, or a seed to initialize a generator. IfNoneor0given, create a new random number generator with a random seed.
- Returns:
DigitalMSA– A MSA in digital mode composed ofNsequences sampled from the HMM.
Example
Generate a random sequence with a fixed seed:
>>> seq = thioesterase.emit_sequence(randomness=42) >>> seq.alphabet.decode(seq.sequence) 'RALFFLHPGTGAVGCYSQLADE...'
Generate a sequence from a given random number generator:
>>> rng = easel.Randomness(123) >>> seq1 = thioesterase.emit_sequence(rng) >>> seq2 = thioesterase.emit_sequence(rng) >>> seq1.sequence != seq2.sequence True
Note
By default, the
hmmemitexecutable uses the linear congruential generator, corresponding to aRandomnesscreated withfast=True. If either a seed orNoneis passed as therandomnessargument, the newRandomnesswill be created withfast=Trueas well.See also
Profile.emit_sequence, which allows generating random sequences from an implicit search profile rather than from the core model.HMM.emit_sequence, which allows generating random sequences from the core model of the HMM rather than from the a search profile.
Added in version 0.12.1.
- __init__(alphabet, M, name)#
Create a new HMM from scratch.
- copy()#
Return a copy of the HMM with the exact same configuration.
Added in version 0.3.0.
- match_occupancy()#
Calculate the match occupancy for each match state.
- Returns:
VectorF– A vector of size \(M+1\) containing the probability that each match state is used in a sample glocal path through the model.
Added in version 0.4.10.
- mean_match_entropy()#
Compute the mean entropy per HMM match state.
The mean entropy per match state emission distribution is defined as:
\[- \frac{1}{M} \sum_{k=1}^{M} \sum_x p_k(x) \log_2 p_k(x)\]where \(p_k(x)\) is the emission probability for symbol \(x\) from match state \(k\).
- Returns:
float– The mean entropy, in bits.
Example
>>> thioesterase.mean_match_entropy() 3.0425...
Added in version 0.4.10.
- mean_match_information(background)#
Compute the mean information content of the HMM match states.
The mean information content per match state emission distribution is defined as:
\[\frac{1}{M} \sum_{k=1}^{M} \left[ \sum_x f(x) \log_2 f(x) - \sum_x p_k(x) \log_2 p_k(x) \right]\]in bits, where \(p_k(x)\) is the emission probability for symbol \(x\) from match state \(k\), and \(f(x)\) is the background emission probability for \(x\) from the null model.
- Parameters:
background (
Background) – The null model from which to get the background emission probabilities.- Returns:
float– The mean information content, in bits.
Example
>>> background = plan7.Background(easel.Alphabet.amino()) >>> thioesterase.mean_match_information(background) 1.1330...
Added in version 0.4.10.
- mean_match_relative_entropy(background)#
Compute the mean relative entropy per HMM match state.
The mean relative entropy per match state emission distribution is defined as:
\[\frac{1}{M} \sum_{k=1}^{M} \sum_x { p_k(x) \log_2 \frac{p_k(x)}{f(x)} }\]in bits, where \(p_k(x)\) is the emission probability for symbol \(x\) from match state \(k\), and \(f(x)\) is the background emission probability for \(x\) from the null model.
- Parameters:
background (
Background) – The null model from which to get the background emission probabilities.- Returns:
float– The mean relative entropy, in bits.
Example
>>> background = plan7.Background(easel.Alphabet.amino()) >>> thioesterase.mean_match_relative_entropy(background) 1.1201...
Added in version 0.4.10.
- renormalize()#
Renormalize all parameter vectors (emissions and transitions).
Added in version 0.4.0.
- scale(scale, exponential=False)#
Rescale counts by a factor in a model containing counts.
This method only affects core probability model emissions and transitions.
- Parameters:
scale (
float) – The scaling factor to use (\(1.0\) for no scaling). Often computed using the ratio of effective sequences (\(\frac{n_{eff}}{n_{seq}}\))exponential (
bool) – When set toTrue, usescaleas an exponential factor (\(C_i = C_i^s\)) instead of a multiplicative factor (\(C_i = C_i \times s\)), resulting in a non-uniform scaling across columns. This can be useful when some heavily fragmented sequences are used to reconstruct a family MSA.
Added in version 0.4.0.
- set_composition()#
Calculate and set the model composition.
Added in version 0.4.0.
- set_consensus(sequence=None)#
Set the model consensus sequence.
If given a sequence, the HMM is assumed to originate from a single-sequence model, and the sequence is used as the consensus. Otherwise, the consensus is computed by extracting the consensus at each position.
Residues in the consensus sequence are uppercased when their emission probabilities are above an arbitrary, alphabet-specific threshold (\(0.9\) for nucleotide alphabets, \(0.5\) for protein).
- Parameters:
sequence (
DigitalSequence) – A sequence in digital mode with \(M\) residues and the same alphabet as the HMM.- Raises:
ValueError – When given a sequence with the wrong length.
AlphabetMismatch – When given a sequence in the wrong alphabet.
Added in version 0.10.1.
- to_profile(background=None, L=400, multihit=True, local=True)#
Create a new profile configured for this HMM.
This method is a shortcut for creating a new
Profileand callingconfigurefor a given HMM. Prefer manually callingconfigureto recycle the profile buffer when running inside a loop.- Parameters:
background (
Background, optional) – The null background model. InNonegiven, create a default one for the HMM alphabet.L (
int) – The expected target sequence length.multihit (
bool) – Whether or not to use multihit mode.local (
bool) – Whether to use local or global mode.
Added in version 0.8.0.
- validate(tolerance=0.0001)#
Validate the HMM agains the Plan7 structural constraints.
HMMs created with PyHMMER are always valid, whether they come from a
Builderor from theHMMconstructor. However, it is possible to manually edit the emission scores and transition probabilities, and the structural constrains may not hold. Call this method to make sure the HMM is valid after you are done editing it.- Parameters:
tolerance (
float) – The tolerance with which to check that the transition probabilities sum to 1.- Raises:
ValueError – When the HMM fails validation.
Added in version 0.8.1.
- write(fh, binary=False)#
Write the HMM to a file handle.
- Parameters:
- zero()#
Set all parameters to zero, including model composition.
- checksum#
The 32-bit checksum of the HMM, if any.
The checksum if calculated from the alignment the HMM was created from, and was introduced in more recent HMM formats. This means some
HMMobjects may have a non-Nonechecksum.Added in version 0.2.1.
Changed in version 0.3.1: Returns
Noneif the HMM flag for the checksum is not set.
- command_line#
The command line that built the model.
For HMMs created with
Builder, this defaults tosys.argv. It can however be set to any string, including multiline to show successive commands.Example
>>> print(thioesterase.command_line) hmmbuild Thioesterase.hmm Thioesterase.fa hmmcalibrate Thioesterase.hmm
Added in version 0.3.1.
- composition#
The model composition.
May not be available for freshly-created HMMs. To get the mean residue composition emitted by the model, the
set_compositionmethod must be called to compute the composition from occupancy.Note
Although the allocated buffer in the
P7_HMMobject is always the same dimension so that it can store the largest possible alphabet, we only expose the relevant residues. This means that the vector will be of sizealphabet.K:>>> dna = easel.Alphabet.dna() >>> dna.K 4 >>> hmm = plan7.HMM(dna, 100, "test") >>> hmm.set_composition() >>> len(hmm.composition) 4
Added in version 0.4.0.
- consensus_accessibility#
The consensus accessibility of the HMM, if any.
Added in version 0.3.1.
- consensus_structure#
The consensus structure of the HMM, if any.
Added in version 0.3.1.
- creation_time#
The creation time of the HMM, if any.
Example
Get the creation time for any HMM:
>>> thioesterase.creation_time datetime.datetime(2008, 11, 25, 17, 28, 32)
Set the creation time manually to a different date and time:
>>> ctime = datetime.datetime(2021, 8, 23, 23, 59, 19) >>> thioesterase.creation_time = ctime >>> thioesterase.creation_time datetime.datetime(2021, 8, 23, 23, 59, 19)
Danger
Internally,
libhmmeralways usesasctimeto generate a timestamp for the HMMs, so this property assumes that every creation time field can be parsed into adatetime.datetimeobject using the"%a %b %d %H:%M:%S %Y"format.Added in version 0.4.6.
- Type:
- cutoffs#
The bitscore cutoffs for this HMM.
- Type:
Cutoffs
- evalue_parameters#
The e-value parameters for this HMM.
- Type:
- insert_emissions#
The insert emissions of the model.
The property exposes a matrix of shape \((M+1, K)\), with one row per node and one column per alphabet symbol.
Caution
If editing this matrix manually, note that rows must contain valid probabilities for the HMM to be valid: each row must contains values between 0 and 1, and sum to 1. Consider calling
HMM.validateafter manual edition.Added in version 0.3.1.
Changed in version 0.4.0: This property is now a
MatrixF, and stores row 0.- Type:
- match_emissions#
The match emissions of the model.
The property exposes a matrix of shape \((M+1, K)\), with one row per node and one column per alphabet symbol.
Note
Since the first row corresponds to the entry probabilities, the emissions are unused. By convention, it should still contain valid probabilities, so it will always be set as follow with 1 probability for the first symbol, and 0 for the rest:
>>> hmm = HMM(easel.Alphabet.dna(), 100, "test") >>> hmm.match_emissions MatrixF([[1.0, 0.0, 0.0, 0.0], ...])
Caution
If editing this matrix manually, note that rows must contain valid probabilities for the HMM to be valid: each row must contains values between 0 and 1, and sum to 1. Consider calling
HMM.validateafter manual edition.Added in version 0.3.1.
Changed in version 0.4.0: This property is now a
MatrixF, and stores row 0.- Type:
- model_mask#
The model mask line from the alignment, if any.
Added in version 0.3.1.
- nseq#
The number of training sequences used, if any.
If the HMM was created from a multiple sequence alignment, this corresponds to the number of sequences in the MSA.
Example
>>> thioesterase.nseq 278
Added in version 0.3.1.
- nseq_effective#
The number of effective sequences used, if any.
Added in version 0.3.1.
- reference#
The reference line from the alignment, if any.
This is relevant if the HMM was built from a multiple sequence alignment (e.g. by
Builder.build_msa, or by an externalhmmbuildpipeline run).Added in version 0.3.1.
- transition_probabilities#
The transition probabilities of the model.
The property exposes a matrix of shape \((M+1, 7)\), with one row per node (plus one extra row for the entry probabilities), and one column per transition.
Columns correspond to the following transitions, in order: \(M_n \to M_{n+1}\), \(M_n \to I_n\), \(M_n \to D_{n+1}\), \(I_n \to I_n\), \(I_n \to M_{n+1}\), \(D_n \to D_{n+1}\), \(D_n \to M_{n+1}\). Use the
pyhmmer.plan7.Transitionsenum instead of hardcoded indices to make your code more legible.Example
>>> T = plan7.Transitions >>> p = thioesterase.transition_probabilities >>> p[1, T.MM] 0.999... >>> p[0, T.DM] # 1 by convention for the first node 1.0 >>> p[-1, T.MD] # 0 by convention for the last node 0.0
Caution
If editing this matrix manually, note that some invariants need to hold for the HMM to be valid: \(I_n\), \(M_n\) and \(D_n\) transition probabilities should only contain probabilities between 0 and 1, and sum to 1:
>>> T = plan7.Transitions >>> p = thioesterase.transition_probabilities >>> p[50, T.MM] + p[50, T.MI] + p[50, T.MD] 1.000... >>> p[50, T.IM] + p[50, T.II] 1.000... >>> p[50, T.DM] + p[50, T.DD] 1.000...
Consider calling
HMM.validateafter manual edition.Added in version 0.3.1.
Changed in version 0.4.0: This property is now a
MatrixF.- Type:
- class pyhmmer.plan7.Profile#
A Plan7 search profile.
Changed in version 0.4.6: Added the
evalue_parametersandcutoffsattributes.- HMM.emit_sequence(randomness=None)#
Emit a sequence from a core HMM.
- Parameters:
randomness (
Randomness,intorNone) – The random number generator to use for sampling, or a seed to initialize a generator. IfNoneor0given, create a new random number generator with a random seed.- Returns:
DigitalSequence– A sequence in digital mode sampled from the HMM. Only thesequencefield will be initialized.
Example
Generate a random sequence with a fixed seed:
>>> seq = thioesterase.emit_sequence(randomness=42) >>> seq.alphabet.decode(seq.sequence) 'RALFFLHPGTGAVGCYSQLADE...'
Generate a sequence from a given random number generator:
>>> rng = easel.Randomness(123) >>> seq1 = thioesterase.emit_sequence(rng) >>> seq2 = thioesterase.emit_sequence(rng) >>> seq1.sequence != seq2.sequence True
Note
By default, the
hmmemitexecutable uses the linear congruential generator, corresponding to aRandomnesscreated withfast=True. If either a seed orNoneis passed as therandomnessargument, the newRandomnesswill be created withfast=Trueas well.See also
Profile.emit_sequence, which allows generating random sequences from an implicit search profile rather than from the core model.HMM.emit_alignment, which generates an alignment of random sequences emitted from the core model rather than single individual sequences.
Added in version 0.12.1.
- __init__(M, alphabet)#
Create a new profile for the given
alphabet.
- clear()#
Clear internal buffers to reuse the profile without reallocation.
- configure(hmm, background, L=400, multihit=True, local=True)#
Configure a search profile using the given models.
- Parameters:
hmm (
HMM) – The model HMM with core probabilities.background (
Background) – The null background model.L (
int) – The expected target sequence length.multihit (
bool) – Whether or not to use multihit modes.local (
bool) – Whether or not to use non-local modes.
- copy()#
Return a copy of the profile with the exact same configuration.
- msv_filter(seq, nu=2.0)#
Compute the generic MSV filter score for the given sequence.
- Parameters:
seq (
DigitalSequence) – The sequence in digital format for which to compute the SSV filter score.nu (
float) – The expected number of hits.
- Returns:
Note
- Raises:
AlphabetMismatch – When the alphabet of the sequence does not correspond to the profile alphabet.
Added in version 0.12.0.
- to_optimized()#
Convert the profile to a platform-specific optimized profile.
- Returns:
OptimizedProfile– The platform-specific optimized profile built using the configuration of this profile.
- consensus#
The consensus residue line of the profile, if set.
Added in version 0.4.1.
- consensus_structure#
The consensus structure of the profile, if any.
Added in version 0.4.1.
- cutoffs#
The bitscore cutoffs for this profile.
- Type:
Cutoffs
- emission_scores#
The emission lod scores of the model.
Note that this is a hand-indexed 3-dimensional tensor, with shape \((K_p, M_{alloc}, 2)\). The last two dimensions are flattened.
Added in version 0.12.0.
- Type:
- evalue_parameters#
The e-value parameters for this profile.
- Type:
EvalueParameters
- class pyhmmer.plan7.OptimizedProfile#
An optimized profile that uses platform-specific instructions.
Optimized profiles store the match emissions and transition probabilities for a profile HMM so that they can be loaded in the SIMD code. Typically, matrices use aligned storage so that they can loaded efficiently, and are striped à la Farrar to compute pairwise scores for each sequence residue and profile node.
References
Farrar, Michael. Striped Smith–Waterman Speeds Database Searches Six Times over Other SIMD Implementations. Bioinformatics 23, no. 2 (15 January 2007): 156–61. doi:10.1093/bioinformatics/btl582.
- __init__(M, alphabet)#
Create a new optimized profile from scratch.
Once allocated, you must call the
convertmethod with aProfileobject. It’s actually easier to useProfile.to_optimizedmethod to obtained a configuredOptimizedProfiledirectly, unless you’re explicitly trying to recycle memory.
- convert(profile)#
Store the given profile into
selfas a platform-specific profile.Use this method to configure an optimized profile from a
Profilewhile recycling the internal vector buffers.- Raises:
ValueError – When the optimized profile is too small to hold the profile, or when the standard and the optimized profiles are not compatible.
See also
The
Profile.to_optimizedmethod, which allows getting anOptimizedProfiledirectly from a profile without having to allocate first.
- copy()#
Create an exact copy of the optimized profile.
Note
The
Alphabetreferenced to by this object is not copied, usecopy.deepcopyif this is the intended behaviour.
- msv_filter(seq)#
Compute the MSV filter score for the given sequence.
- Parameters:
seq (
DigitalSequence) – The sequence in digital format for which to compute the MSV filter score.- Returns:
float– The MSV filter score, in nats, for the input sequence.
Note
math.infmay be returned if an overflow occurs.- Raises:
AlphabetMismatch – When the alphabet of the sequence does not correspond to the profile alphabet.
Added in version 0.12.0.
- ssv_filter(seq)#
Compute the SSV filter score for the given sequence.
- Parameters:
seq (
DigitalSequence) – The sequence in digital format for which to compute the SSV filter score.- Returns:
Note
- Raises:
AlphabetMismatch – When the alphabet of the sequence does not correspond to the profile alphabet.
Caution
This method is not available on the PowerPC platform (calling it will raise a
NotImplementedError).Added in version 0.4.0.
- write(fh_filter, fh_profile)#
Write an optimized profile to two separate files.
HMMER implements an acceleration pipeline using several scoring algorithms. Parameters for MSV (the Multi ungapped Segment Viterbi) are saved independently to the
fh_filterhandle, while the rest of the profile is saved tofh_profile.
- bias_b#
The positive bias to emission scores.
Added in version 0.4.0.
Changed in version 0.10.3: Renamed from
bias.- Type:
- consensus#
The consensus residue line of the profile, if any.
Added in version 0.4.11.
- consensus_structure#
The consensus structure of the profile, if any.
Added in version 0.4.11.
- cutoffs#
The bitscore cutoffs for this profile.
- Type:
Cutoffs
- ddbound_w#
The threshold precalculated for lazy \(DD\) evaluation.
Added in version 0.10.3.
- Type:
- evalue_parameters#
The e-value parameters for this profile.
- Type:
EvalueParameters
- model_mask#
The model mask line from the alignment, if any.
Added in version 0.3.1.
- ncj_roundoff#
The missing precision on \(NN,CC,JJ\) after rounding.
Added in version 0.10.3.
- Type:
- offsets#
The disk offsets for this optimized profile.
- Type:
Offsets
- reference#
The reference line from the alignment, if any.
This is relevant if the profile was built from a multiple sequence alignment (e.g. by
Builder.build_msa, or by an externalhmmbuildpipeline run).Added in version 0.3.1.
- class pyhmmer.plan7.OptimizedProfileBlock#
A container for storing
OptimizedProfileobjects.This collections allows the scan loop of
Pipelineobjects to scan several target profiles without having to acquire the GIL for each new model. It does so by synchronizing a PythonliststoringOptimizedProfileobjects with a C-contiguous array of pointers to the underlying struct of each object.Added in version 0.7.0.
- __init__(alphabet, iterable=())#
Create a new block from an iterable of
OptimizedProfileobjects.
- append(optimized_profile)#
Append an optimized profile at the end of the block.
- clear()#
Remove all optimized profiles from the block.
- copy()#
Return a copy of the optimized profile block.
Note
The optimized profiles internally refered to by this collection are not copied. Use
copy.deepcopyif you also want to duplicate the internal storage of each optimized profile.
- extend(iterable)#
Extend block by appending optimized profiles from the iterable.
- index(optimized_profile, start=0, stop=9223372036854775807)#
Return the index of the first occurence of
optimized_profile.- Raises:
ValueError – When the block does not contain
optimized_profile.
- insert(index, optimized_profile)#
Insert a new optimized profile in the block before
index.
- pop(index=-1)#
Remove and return an optimized profile from the block.
- remove(optimized_profile)#
Remove the first occurence of the given optimized profile.