Profile Hidden Markov Models#
- class pyhmmer.plan7.HMM#
A data structure storing the Plan7 Hidden Markov Model.
Changed in version 0.4.6: Added the
evalue_parameters
andcutoffs
attributes.- __copy__()#
None
- __deepcopy__(memo)#
None
- __getstate__()#
None
- __init__(alphabet, M, name)#
Create a new HMM from scratch.
- __reduce__()#
None
- __setstate__(state)#
None
- __sizeof__()#
None
- copy()#
Return a copy of the HMM with the exact same configuration.
Added in version 0.3.0.
- match_occupancy()#
Calculate the match occupancy for each match state.
- Returns:
VectorF
– A vector of size \(M+1\) containing the probability that each match state is used in a sample glocal path through the model.
Added in version 0.4.10.
- mean_match_entropy()#
Compute the mean entropy per HMM match state.
The mean entropy per match state emission distribution is defined as:
\[- \frac{1}{M} \sum_{k=1}^{M} \sum_x p_k(x) \log_2 p_k(x)\]where \(p_k(x)\) is the emission probability for symbol \(x\) from match state \(k\).
- Returns:
float
– The mean entropy, in bits.
Example
>>> thioesterase.mean_match_entropy() 3.0425...
Added in version 0.4.10.
- mean_match_information(background)#
Compute the mean information content of the HMM match states.
The mean information content per match state emission distribution is defined as:
\[\frac{1}{M} \sum_{k=1}^{M} \left[ \sum_x f(x) \log_2 f(x) - \sum_x p_k(x) \log_2 p_k(x) \right]\]in bits, where \(p_k(x)\) is the emission probability for symbol \(x\) from match state \(k\), and \(f(x)\) is the background emission probability for \(x\) from the null model.
- Parameters:
background (
Background
) – The null model from which to get the background emission probabilities.- Returns:
float
– The mean information content, in bits.
Example
>>> background = plan7.Background(easel.Alphabet.amino()) >>> thioesterase.mean_match_information(background) 1.1330...
Added in version 0.4.10.
- mean_match_relative_entropy(background)#
Compute the mean relative entropy per HMM match state.
The mean relative entropy per match state emission distribution is defined as:
\[\frac{1}{M} \sum_{k=1}^{M} \sum_x { p_k(x) \log_2 \frac{p_k(x)}{f(x)} }\]in bits, where \(p_k(x)\) is the emission probability for symbol \(x\) from match state \(k\), and \(f(x)\) is the background emission probability for \(x\) from the null model.
- Parameters:
background (
Background
) – The null model from which to get the background emission probabilities.- Returns:
float
– The mean relative entropy, in bits.
Example
>>> background = plan7.Background(easel.Alphabet.amino()) >>> thioesterase.mean_match_relative_entropy(background) 1.1201...
Added in version 0.4.10.
- renormalize()#
Renormalize all parameter vectors (emissions and transitions).
Added in version 0.4.0.
- classmethod sample(alphabet, M, randomness, ungapped=False, enumerable=False)#
Sample an HMM of length
M
at random.- Parameters:
alphabet (
Alphabet
) – The alphabet of the model.M (
int
) – The length of the model to generate (i.e. the number of nodes).randomness (
Randomness
) – The random number generator to use for sampling.ungapped (
bool
) – Set toTrue
to build an ungapped HMM, i.e. an HMM where the \(M_n o M_{n+1}\) are all one and the remaining transitions are zero. Ignored whenenumerable
isTrue
.enumerable (
bool
) – Set toTrue
to build a random HMM with no nonzero insertion transitions.
- Returns:
HMM
– A new HMM generated at random.
Added in version 0.7.0.
- scale(scale, exponential=False)#
Rescale counts by a factor in a model containing counts.
This method only affects core probability model emissions and transitions.
- Parameters:
scale (
float
) – The scaling factor to use (\(1.0\) for no scaling). Often computed using the ratio of effective sequences (\(\frac{n_{eff}}{n_{seq}}\))exponential (
bool
) – When set toTrue
, usescale
as an exponential factor (\(C_i = C_i^s\)) instead of a multiplicative factor (\(C_i = C_i \times s\)), resulting in a non-uniform scaling across columns. This can be useful when some heavily fragmented sequences are used to reconstruct a family MSA.
Added in version 0.4.0.
- set_composition()#
Calculate and set the model composition.
Added in version 0.4.0.
- set_consensus(sequence=None)#
Set the model consensus sequence.
If given a sequence, the HMM is assumed to originate from a single-sequence model, and the sequence is used as the consensus. Otherwise, the consensus is computed by extracting the consensus at each position.
Residues in the consensus sequence are uppercased when their emission probabilities are above an arbitrary, alphabet-specific threshold (\(0.9\) for nucleotide alphabets, \(0.5\) for protein).
- Parameters:
sequence (
DigitalSequence
) – A sequence in digital mode with \(M\) residues and the same alphabet as the HMM.- Raises:
ValueError – When given a sequence with the wrong length.
AlphabetMismatch – When given a sequence in the wrong alphabet.
Added in version 0.10.1.
- to_profile(background=None, L=400, multihit=True, local=True)#
Create a new profile configured for this HMM.
This method is a shortcut for creating a new
Profile
and callingconfigure
for a given HMM. Prefer manually callingconfigure
to recycle the profile buffer when running inside a loop.- Parameters:
background (
Background
, optional) – The null background model. InNone
given, create a default one for the HMM alphabet.L (
int
) – The expected target sequence length.multihit (
bool
) – Whether or not to use multihit mode.local (
bool
) – Whether to use local or global mode.
Added in version 0.8.0.
- validate(tolerance=0.0001)#
Validate the HMM agains the Plan7 structural constraints.
HMMs created with PyHMMER are always valid, whether they come from a
Builder
or from theHMM
constructor. However, it is possible to manually edit the emission scores and transition probabilities, and the structural constrains may not hold. Call this method to make sure the HMM is valid after you are done editing it.- Parameters:
tolerance (
float
) – The tolerance with which to check that the transition probabilities sum to 1.- Raises:
ValueError – When the HMM fails validation.
Added in version 0.8.1.
- write(fh, binary=False)#
Write the HMM to a file handle.
- Parameters:
- zero()#
Set all parameters to zero, including model composition.
- checksum#
The 32-bit checksum of the HMM, if any.
The checksum if calculated from the alignment the HMM was created from, and was introduced in more recent HMM formats. This means some
HMM
objects may have a non-None
checksum.Added in version 0.2.1.
Changed in version 0.3.1: Returns
None
if the HMM flag for the checksum is not set.
- command_line#
The command line that built the model.
For HMMs created with
Builder
, this defaults tosys.argv
. It can however be set to any string, including multiline to show successive commands.Example
>>> print(thioesterase.command_line) hmmbuild Thioesterase.hmm Thioesterase.fa hmmcalibrate Thioesterase.hmm
Added in version 0.3.1.
- composition#
The model composition.
May not be available for freshly-created HMMs. To get the mean residue composition emitted by the model, the
set_composition
method must be called to compute the composition from occupancy.Note
Although the allocated buffer in the
P7_HMM
object is always the same dimension so that it can store the largest possible alphabet, we only expose the relevant residues. This means that the vector will be of sizealphabet.K
:>>> dna = easel.Alphabet.dna() >>> dna.K 4 >>> hmm = plan7.HMM(dna, 100, b"test") >>> hmm.set_composition() >>> len(hmm.composition) 4
Added in version 0.4.0.
- consensus_accessibility#
The consensus accessibility of the HMM, if any.
Added in version 0.3.1.
- consensus_structure#
The consensus structure of the HMM, if any.
Added in version 0.3.1.
- creation_time#
The creation time of the HMM, if any.
Example
Get the creation time for any HMM:
>>> thioesterase.creation_time datetime.datetime(2008, 11, 25, 17, 28, 32)
Set the creation time manually to a different date and time:
>>> ctime = datetime.datetime(2021, 8, 23, 23, 59, 19) >>> thioesterase.creation_time = ctime >>> thioesterase.creation_time datetime.datetime(2021, 8, 23, 23, 59, 19)
Danger
Internally,
libhmmer
always usesasctime
to generate a timestamp for the HMMs, so this property assumes that every creation time field can be parsed into adatetime.datetime
object using the"%a %b %d %H:%M:%S %Y"
format.Added in version 0.4.6.
- Type:
- cutoffs#
The bitscore cutoffs for this HMM.
- Type:
Cutoffs
- evalue_parameters#
The e-value parameters for this HMM.
- Type:
- insert_emissions#
The insert emissions of the model.
The property exposes a matrix of shape \((M+1, K)\), with one row per node and one column per alphabet symbol.
Caution
If editing this matrix manually, note that rows must contain valid probabilities for the HMM to be valid: each row must contains values between 0 and 1, and sum to 1. Consider calling
HMM.validate
after manual edition.Added in version 0.3.1.
Changed in version 0.4.0: This property is now a
MatrixF
, and stores row 0.- Type:
- match_emissions#
The match emissions of the model.
The property exposes a matrix of shape \((M+1, K)\), with one row per node and one column per alphabet symbol.
Note
Since the first row corresponds to the entry probabilities, the emissions are unused. By convention, it should still contain valid probabilities, so it will always be set as follow with 1 probability for the first symbol, and 0 for the rest:
>>> hmm = HMM(easel.Alphabet.dna(), 100, b"test") >>> hmm.match_emissions MatrixF([[1.0, 0.0, 0.0, 0.0], ...])
Caution
If editing this matrix manually, note that rows must contain valid probabilities for the HMM to be valid: each row must contains values between 0 and 1, and sum to 1. Consider calling
HMM.validate
after manual edition.Added in version 0.3.1.
Changed in version 0.4.0: This property is now a
MatrixF
, and stores row 0.- Type:
- model_mask#
The model mask line from the alignment, if any.
Added in version 0.3.1.
- nseq#
The number of training sequences used, if any.
If the HMM was created from a multiple sequence alignment, this corresponds to the number of sequences in the MSA.
Example
>>> thioesterase.nseq 278
Added in version 0.3.1.
- nseq_effective#
The number of effective sequences used, if any.
Added in version 0.3.1.
- reference#
The reference line from the alignment, if any.
This is relevant if the HMM was built from a multiple sequence alignment (e.g. by
Builder.build_msa
, or by an externalhmmbuild
pipeline run).Added in version 0.3.1.
- transition_probabilities#
The transition probabilities of the model.
The property exposes a matrix of shape \((M+1, 7)\), with one row per node (plus one extra row for the entry probabilities), and one column per transition.
Columns correspond to the following transitions, in order: \(M_n \to M_{n+1}\), \(M_n \to I_n\), \(M_n \to D_{n+1}\), \(I_n \to I_n\), \(I_n \to M_{n+1}\), \(D_n \to D_{n+1}\), \(D_n \to M_{n+1}\). Use the
pyhmmer.plan7.Transitions
enum instead of hardcoded indices to make your code more legible.Example
>>> t = thioesterase.transition_probabilities >>> t[1, Transitions.MM] 0.999... >>> t[0, Transitions.DM] # 1 by convention for the first node 1.0 >>> t[-1, Transitions.MD] # 0 by convention for the last node 0.0
Caution
If editing this matrix manually, note that some invariants need to hold for the HMM to be valid: \(I_n\), \(M_n\) and \(D_n\) transition probabilities should only contain probabilities between 0 and 1, and sum to 1:
>>> t = thioesterase.transition_probabilities >>> t[50, Transitions.MM] + t[50, Transitions.MI] + t[50, Transitions.MD] 1.000... >>> t[50, Transitions.IM] + t[50, Transitions.II] 1.000... >>> t[50, Transitions.DM] + t[50, Transitions.DD] 1.000...
Consider calling
HMM.validate
after manual edition.Added in version 0.3.1.
Changed in version 0.4.0: This property is now a
MatrixF
.- Type:
- class pyhmmer.plan7.Profile#
A Plan7 search profile.
Changed in version 0.4.6: Added the
evalue_parameters
andcutoffs
attributes.- __copy__()#
None
- __deepcopy__(memo)#
None
- __init__(M, alphabet)#
Create a new profile for the given
alphabet
.
- __sizeof__()#
None
- clear()#
Clear internal buffers to reuse the profile without reallocation.
- configure(hmm, background, L=400, multihit=True, local=True)#
Configure a search profile using the given models.
- Parameters:
hmm (
HMM
) – The model HMM with core probabilities.background (
Background
) – The null background model.L (
int
) – The expected target sequence length.multihit (
bool
) – Whether or not to use multihit modes.local (
bool
) – Whether or not to use non-local modes.
- copy()#
Return a copy of the profile with the exact same configuration.
- to_optimized()#
Convert the profile to a platform-specific optimized profile.
- Returns:
OptimizedProfile
– The platform-specific optimized profile built using the configuration of this profile.
- consensus#
The consensus residue line of the profile, if set.
Added in version 0.4.1.
- consensus_structure#
The consensus structure of the profile, if any.
Added in version 0.4.1.
- cutoffs#
The bitscore cutoffs for this profile.
- Type:
Cutoffs
- evalue_parameters#
The e-value parameters for this profile.
- Type:
EvalueParameters
- class pyhmmer.plan7.OptimizedProfile#
An optimized profile that uses platform-specific instructions.
Optimized profiles store the match emissions and transition probabilities for a profile HMM so that they can be loaded in the SIMD code. Typically, matrices use aligned storage so that they can loaded efficiently, and are striped à la Farrar to compute pairwise scores for each sequence residue and profile node.
References
Farrar, Michael. Striped Smith–Waterman Speeds Database Searches Six Times over Other SIMD Implementations. Bioinformatics 23, no. 2 (15 January 2007): 156–61. doi:10.1093/bioinformatics/btl582.
- __copy__()#
None
- __deepcopy__(memo)#
None
- __init__(M, alphabet)#
Create a new optimized profile from scratch.
Once allocated, you must call the
convert
method with aProfile
object. It’s actually easier to useProfile.to_optimized
method to obtained a configuredOptimizedProfile
directly, unless you’re explicitly trying to recycle memory.
- __sizeof__()#
None
- convert(profile)#
Store the given profile into
self
as a platform-specific profile.Use this method to configure an optimized profile from a
Profile
while recycling the internal vector buffers.- Raises:
ValueError – When the optimized profile is too small to hold the profile, or when the standard and the optimized profiles are not compatible.
See also
The
Profile.to_optimized
method, which allows getting anOptimizedProfile
directly from a profile without having to allocate first.
- copy()#
Create an exact copy of the optimized profile.
Note
The
Alphabet
referenced to by this object is not copied, usecopy.deepcopy
if this is the intended behaviour.
- ssv_filter(seq)#
Compute the SSV filter score for the given sequence.
- Parameters:
seq (
DigitalSequence
) – The sequence in digital format for which to compute the SSV filter score.- Returns:
Note
- Raises:
AlphabetMismatch – When the alphabet of the sequence does not correspond to the profile alphabet.
Caution
This method is not available on the PowerPC platform (calling it will raise a
NotImplementedError
).Added in version 0.4.0.
- write(fh_filter, fh_profile)#
Write an optimized profile to two separate files.
HMMER implements an acceleration pipeline using several scoring algorithms. Parameters for MSV (the Multi ungapped Segment Viterbi) are saved independently to the
fh_filter
handle, while the rest of the profile is saved tofh_profile
.
- bias_b#
The positive bias to emission scores.
Added in version 0.4.0.
Changed in version 0.10.3: Renamed from
bias
.- Type:
- consensus#
The consensus residue line of the profile, if any.
Added in version 0.4.11.
- consensus_structure#
The consensus structure of the profile, if any.
Added in version 0.4.11.
- cutoffs#
The bitscore cutoffs for this profile.
- Type:
Cutoffs
- ddbound_w#
The threshold precalculated for lazy \(DD\) evaluation.
Added in version 0.10.3.
- Type:
- evalue_parameters#
The e-value parameters for this profile.
- Type:
EvalueParameters
- model_mask#
The model mask line from the alignment, if any.
Added in version 0.3.1.
- ncj_roundoff#
The missing precision on \(NN,CC,JJ\) after rounding.
Added in version 0.10.3.
- Type:
- offsets#
The disk offsets for this optimized profile.
- Type:
Offsets
- reference#
The reference line from the alignment, if any.
This is relevant if the profile was built from a multiple sequence alignment (e.g. by
Builder.build_msa
, or by an externalhmmbuild
pipeline run).Added in version 0.3.1.
- class pyhmmer.plan7.OptimizedProfileBlock#
A container for storing
OptimizedProfile
objects.This collections allows the scan loop of
Pipeline
objects to scan several target profiles without having to acquire the GIL for each new model. It does so by synchronizing a Pythonlist
storingOptimizedProfile
objects with a C-contiguous array of pointers to the underlying struct of each object.Added in version 0.7.0.
- __copy__()#
None
- __init__(alphabet, iterable=())#
Create a new block from an iterable of
OptimizedProfile
objects.
- __reduce__()#
None
- __sizeof__()#
None
- append(optimized_profile)#
Append an optimized profile at the end of the block.
- clear()#
Remove all optimized profiles from the block.
- copy()#
Return a copy of the optimized profile block.
Note
The optimized profiles internally refered to by this collection are not copied. Use
copy.deepcopy
if you also want to duplicate the internal storage of each optimized profile.
- extend(iterable)#
Extend block by appending optimized profiles from the iterable.
- index(optimized_profile, start=0, stop=9223372036854775807)#
Return the index of the first occurence of
optimized_profile
.- Raises:
ValueError – When the block does not contain
optimized_profile
.
- insert(index, optimized_profile)#
Insert a new optimized profile in the block before
index
.
- pop(index=-1)#
Remove and return an optimized profile from the block.
- remove(optimized_profile)#
Remove the first occurence of the given optimized profile.