Profile Hidden Markov Models#

class pyhmmer.plan7.HMM#

A data structure storing the Plan7 Hidden Markov Model.

alphabet#

The alphabet of the model.

Type:

Alphabet

Changed in version 0.4.6: Added the evalue_parameters and cutoffs attributes.

__copy__()#

None

__deepcopy__(memo)#

None

__getstate__()#

None

__init__(alphabet, M, name)#

Create a new HMM from scratch.

Parameters:
  • alphabet (Alphabet) – The alphabet of the model.

  • M (int) – The length of the model (i.e. the number of nodes).

  • name (bytes) – The name of the model.

__reduce__()#

None

__setstate__(state)#

None

__sizeof__()#

None

copy()#

Return a copy of the HMM with the exact same configuration.

Added in version 0.3.0.

match_occupancy()#

Calculate the match occupancy for each match state.

Returns:

VectorF – A vector of size \(M+1\) containing the probability that each match state is used in a sample glocal path through the model.

Added in version 0.4.10.

mean_match_entropy()#

Compute the mean entropy per HMM match state.

The mean entropy per match state emission distribution is defined as:

\[- \frac{1}{M} \sum_{k=1}^{M} \sum_x p_k(x) \log_2 p_k(x)\]

where \(p_k(x)\) is the emission probability for symbol \(x\) from match state \(k\).

Returns:

float – The mean entropy, in bits.

Example

>>> thioesterase.mean_match_entropy()
3.0425...

Added in version 0.4.10.

mean_match_information(background)#

Compute the mean information content of the HMM match states.

The mean information content per match state emission distribution is defined as:

\[\frac{1}{M} \sum_{k=1}^{M} \left[ \sum_x f(x) \log_2 f(x) - \sum_x p_k(x) \log_2 p_k(x) \right]\]

in bits, where \(p_k(x)\) is the emission probability for symbol \(x\) from match state \(k\), and \(f(x)\) is the background emission probability for \(x\) from the null model.

Parameters:

background (Background) – The null model from which to get the background emission probabilities.

Returns:

float – The mean information content, in bits.

Example

>>> background = plan7.Background(easel.Alphabet.amino())
>>> thioesterase.mean_match_information(background)
1.1330...

Added in version 0.4.10.

mean_match_relative_entropy(background)#

Compute the mean relative entropy per HMM match state.

The mean relative entropy per match state emission distribution is defined as:

\[\frac{1}{M} \sum_{k=1}^{M} \sum_x { p_k(x) \log_2 \frac{p_k(x)}{f(x)} }\]

in bits, where \(p_k(x)\) is the emission probability for symbol \(x\) from match state \(k\), and \(f(x)\) is the background emission probability for \(x\) from the null model.

Parameters:

background (Background) – The null model from which to get the background emission probabilities.

Returns:

float – The mean relative entropy, in bits.

Example

>>> background = plan7.Background(easel.Alphabet.amino())
>>> thioesterase.mean_match_relative_entropy(background)
1.1201...

Added in version 0.4.10.

renormalize()#

Renormalize all parameter vectors (emissions and transitions).

Added in version 0.4.0.

classmethod sample(alphabet, M, randomness, ungapped=False, enumerable=False)#

Sample an HMM of length M at random.

Parameters:
  • alphabet (Alphabet) – The alphabet of the model.

  • M (int) – The length of the model to generate (i.e. the number of nodes).

  • randomness (Randomness) – The random number generator to use for sampling.

  • ungapped (bool) – Set to True to build an ungapped HMM, i.e. an HMM where the \(M_n o M_{n+1}\) are all one and the remaining transitions are zero. Ignored when enumerable is True.

  • enumerable (bool) – Set to True to build a random HMM with no nonzero insertion transitions.

Returns:

HMM – A new HMM generated at random.

Added in version 0.7.0.

scale(scale, exponential=False)#

Rescale counts by a factor in a model containing counts.

This method only affects core probability model emissions and transitions.

Parameters:
  • scale (float) – The scaling factor to use (\(1.0\) for no scaling). Often computed using the ratio of effective sequences (\(\frac{n_{eff}}{n_{seq}}\))

  • exponential (bool) – When set to True, use scale as an exponential factor (\(C_i = C_i^s\)) instead of a multiplicative factor (\(C_i = C_i \times s\)), resulting in a non-uniform scaling across columns. This can be useful when some heavily fragmented sequences are used to reconstruct a family MSA.

Added in version 0.4.0.

set_composition()#

Calculate and set the model composition.

Added in version 0.4.0.

set_consensus(sequence=None)#

Set the model consensus sequence.

If given a sequence, the HMM is assumed to originate from a single-sequence model, and the sequence is used as the consensus. Otherwise, the consensus is computed by extracting the consensus at each position.

Residues in the consensus sequence are uppercased when their emission probabilities are above an arbitrary, alphabet-specific threshold (\(0.9\) for nucleotide alphabets, \(0.5\) for protein).

Parameters:

sequence (DigitalSequence) – A sequence in digital mode with \(M\) residues and the same alphabet as the HMM.

Raises:

Added in version 0.10.1.

to_profile(background=None, L=400, multihit=True, local=True)#

Create a new profile configured for this HMM.

This method is a shortcut for creating a new Profile and calling configure for a given HMM. Prefer manually calling configure to recycle the profile buffer when running inside a loop.

Parameters:
  • background (Background, optional) – The null background model. In None given, create a default one for the HMM alphabet.

  • L (int) – The expected target sequence length.

  • multihit (bool) – Whether or not to use multihit mode.

  • local (bool) – Whether to use local or global mode.

Added in version 0.8.0.

validate(tolerance=0.0001)#

Validate the HMM agains the Plan7 structural constraints.

HMMs created with PyHMMER are always valid, whether they come from a Builder or from the HMM constructor. However, it is possible to manually edit the emission scores and transition probabilities, and the structural constrains may not hold. Call this method to make sure the HMM is valid after you are done editing it.

Parameters:

tolerance (float) – The tolerance with which to check that the transition probabilities sum to 1.

Raises:

ValueError – When the HMM fails validation.

Added in version 0.8.1.

write(fh, binary=False)#

Write the HMM to a file handle.

Parameters:
  • fh (io.IOBase) – A Python file handle, opened in binary mode (this must be the case even with binary=False, since the C code will emit bytes in either case).

  • binary (bool) – Pass False to emit the file in ASCII mode using the latest supported HMMER format, or True to use the binary HMMER3 format.

zero()#

Set all parameters to zero, including model composition.

M#

The length of the model (i.e. the number of nodes).

Type:

int

accession#

The accession of the HMM, if any.

Type:

bytes or None

checksum#

The 32-bit checksum of the HMM, if any.

The checksum if calculated from the alignment the HMM was created from, and was introduced in more recent HMM formats. This means some HMM objects may have a non-None checksum.

Added in version 0.2.1.

Changed in version 0.3.1: Returns None if the HMM flag for the checksum is not set.

Type:

int or None

command_line#

The command line that built the model.

For HMMs created with Builder, this defaults to sys.argv. It can however be set to any string, including multiline to show successive commands.

Example

>>> print(thioesterase.command_line)
hmmbuild Thioesterase.hmm Thioesterase.fa
hmmcalibrate Thioesterase.hmm

Added in version 0.3.1.

Type:

str or None

composition#

The model composition.

May not be available for freshly-created HMMs. To get the mean residue composition emitted by the model, the set_composition method must be called to compute the composition from occupancy.

Note

Although the allocated buffer in the P7_HMM object is always the same dimension so that it can store the largest possible alphabet, we only expose the relevant residues. This means that the vector will be of size alphabet.K:

>>> dna = easel.Alphabet.dna()
>>> dna.K
4
>>> hmm = plan7.HMM(dna, 100, b"test")
>>> hmm.set_composition()
>>> len(hmm.composition)
4

Added in version 0.4.0.

Type:

VectorF or None

consensus#

The consensus residue line of the HMM, if set.

Added in version 0.3.0.

Type:

str or None

consensus_accessibility#

The consensus accessibility of the HMM, if any.

Added in version 0.3.1.

Type:

str or None

consensus_structure#

The consensus structure of the HMM, if any.

Added in version 0.3.1.

Type:

str or None

creation_time#

The creation time of the HMM, if any.

Example

Get the creation time for any HMM:

>>> thioesterase.creation_time
datetime.datetime(2008, 11, 25, 17, 28, 32)

Set the creation time manually to a different date and time:

>>> ctime = datetime.datetime(2021, 8, 23, 23, 59, 19)
>>> thioesterase.creation_time = ctime
>>> thioesterase.creation_time
datetime.datetime(2021, 8, 23, 23, 59, 19)

Danger

Internally, libhmmer always uses asctime to generate a timestamp for the HMMs, so this property assumes that every creation time field can be parsed into a datetime.datetime object using the "%a %b %d %H:%M:%S %Y" format.

Added in version 0.4.6.

Type:

datetime.datetime or None

cutoffs#

The bitscore cutoffs for this HMM.

Type:

Cutoffs

description#

The description of the HMM, if any.

Type:

bytes or None

evalue_parameters#

The e-value parameters for this HMM.

Type:

EvalueParameters

insert_emissions#

The insert emissions of the model.

The property exposes a matrix of shape \((M+1, K)\), with one row per node and one column per alphabet symbol.

Caution

If editing this matrix manually, note that rows must contain valid probabilities for the HMM to be valid: each row must contains values between 0 and 1, and sum to 1. Consider calling HMM.validate after manual edition.

Added in version 0.3.1.

Changed in version 0.4.0: This property is now a MatrixF, and stores row 0.

Type:

MatrixF

match_emissions#

The match emissions of the model.

The property exposes a matrix of shape \((M+1, K)\), with one row per node and one column per alphabet symbol.

Note

Since the first row corresponds to the entry probabilities, the emissions are unused. By convention, it should still contain valid probabilities, so it will always be set as follow with 1 probability for the first symbol, and 0 for the rest:

>>> hmm = HMM(easel.Alphabet.dna(), 100, b"test")
>>> hmm.match_emissions
MatrixF([[1.0, 0.0, 0.0, 0.0], ...])

Caution

If editing this matrix manually, note that rows must contain valid probabilities for the HMM to be valid: each row must contains values between 0 and 1, and sum to 1. Consider calling HMM.validate after manual edition.

Added in version 0.3.1.

Changed in version 0.4.0: This property is now a MatrixF, and stores row 0.

Type:

MatrixF

model_mask#

The model mask line from the alignment, if any.

Added in version 0.3.1.

Type:

str or None

name#

The name of the HMM.

Type:

bytes

nseq#

The number of training sequences used, if any.

If the HMM was created from a multiple sequence alignment, this corresponds to the number of sequences in the MSA.

Example

>>> thioesterase.nseq
278

Added in version 0.3.1.

Type:

int or None

nseq_effective#

The number of effective sequences used, if any.

Added in version 0.3.1.

Type:

float or None

reference#

The reference line from the alignment, if any.

This is relevant if the HMM was built from a multiple sequence alignment (e.g. by Builder.build_msa, or by an external hmmbuild pipeline run).

Added in version 0.3.1.

Type:

str or None

transition_probabilities#

The transition probabilities of the model.

The property exposes a matrix of shape \((M+1, 7)\), with one row per node (plus one extra row for the entry probabilities), and one column per transition.

Columns correspond to the following transitions, in order: \(M_n \to M_{n+1}\), \(M_n \to I_n\), \(M_n \to D_{n+1}\), \(I_n \to I_n\), \(I_n \to M_{n+1}\), \(D_n \to D_{n+1}\), \(D_n \to M_{n+1}\). Use the pyhmmer.plan7.Transitions enum instead of hardcoded indices to make your code more legible.

Example

>>> t = thioesterase.transition_probabilities
>>> t[1, Transitions.MM]
0.999...
>>> t[0, Transitions.DM]  # 1 by convention for the first node
1.0
>>> t[-1, Transitions.MD] # 0 by convention for the last node
0.0

Caution

If editing this matrix manually, note that some invariants need to hold for the HMM to be valid: \(I_n\), \(M_n\) and \(D_n\) transition probabilities should only contain probabilities between 0 and 1, and sum to 1:

>>> t = thioesterase.transition_probabilities
>>> t[50, Transitions.MM] + t[50, Transitions.MI] + t[50, Transitions.MD]
1.000...
>>> t[50, Transitions.IM] + t[50, Transitions.II]
1.000...
>>> t[50, Transitions.DM] + t[50, Transitions.DD]
1.000...

Consider calling HMM.validate after manual edition.

Added in version 0.3.1.

Changed in version 0.4.0: This property is now a MatrixF.

Type:

MatrixF

class pyhmmer.plan7.Profile#

A Plan7 search profile.

alphabet#

The alphabet for which this profile was configured.

Type:

Alphabet

Changed in version 0.4.6: Added the evalue_parameters and cutoffs attributes.

__copy__()#

None

__deepcopy__(memo)#

None

__init__(M, alphabet)#

Create a new profile for the given alphabet.

Parameters:
  • M (int) – The length of the profile, i.e. the number of nodes.

  • alphabet (Alphabet) – The alphabet to use with this profile.

__sizeof__()#

None

clear()#

Clear internal buffers to reuse the profile without reallocation.

configure(hmm, background, L=400, multihit=True, local=True)#

Configure a search profile using the given models.

Parameters:
  • hmm (HMM) – The model HMM with core probabilities.

  • background (Background) – The null background model.

  • L (int) – The expected target sequence length.

  • multihit (bool) – Whether or not to use multihit modes.

  • local (bool) – Whether or not to use non-local modes.

copy()#

Return a copy of the profile with the exact same configuration.

to_optimized()#

Convert the profile to a platform-specific optimized profile.

Returns:

OptimizedProfile – The platform-specific optimized profile built using the configuration of this profile.

L#

The current configured target sequence length.

Type:

int

M#

The length of the profile (i.e. the number of nodes).

Added in version 0.3.0.

Type:

int

accession#

The accession of the profile, if any.

Added in version 0.3.0.

Type:

bytes or None

consensus#

The consensus residue line of the profile, if set.

Added in version 0.4.1.

Type:

str or None

consensus_structure#

The consensus structure of the profile, if any.

Added in version 0.4.1.

Type:

str or None

cutoffs#

The bitscore cutoffs for this profile.

Type:

Cutoffs

description#

The description of the profile, if any.

Added in version 0.3.0.

Type:

bytes or None

evalue_parameters#

The e-value parameters for this profile.

Type:

EvalueParameters

local#

Whether the profile is in local mode.

Added in version 0.7.0.

Type:

bool

multihit#

Whether the profile is in multihit mode.

Added in version 0.7.0.

Type:

bool

name#

The name of the profile, if any.

Added in version 0.3.0.

Type:

bytes or None

offsets#

The disk offsets for this profile.

Type:

Offsets

class pyhmmer.plan7.OptimizedProfile#

An optimized profile that uses platform-specific instructions.

Optimized profiles store the match emissions and transition probabilities for a profile HMM so that they can be loaded in the SIMD code. Typically, matrices use aligned storage so that they can loaded efficiently, and are striped à la Farrar to compute pairwise scores for each sequence residue and profile node.

alphabet#

The alphabet for which this optimized profile is configured.

Type:

Alphabet

References

  • Farrar, Michael. Striped Smith–Waterman Speeds Database Searches Six Times over Other SIMD Implementations. Bioinformatics 23, no. 2 (15 January 2007): 156–61. doi:10.1093/bioinformatics/btl582.

__copy__()#

None

__deepcopy__(memo)#

None

__init__(M, alphabet)#

Create a new optimized profile from scratch.

Once allocated, you must call the convert method with a Profile object. It’s actually easier to use Profile.to_optimized method to obtained a configured OptimizedProfile directly, unless you’re explicitly trying to recycle memory.

Parameters:
  • M (int) – The length of the model (i.e. the number of nodes).

  • alphabet (Alphabet) – The alphabet of the model.

__sizeof__()#

None

convert(profile)#

Store the given profile into self as a platform-specific profile.

Use this method to configure an optimized profile from a Profile while recycling the internal vector buffers.

Raises:

ValueError – When the optimized profile is too small to hold the profile, or when the standard and the optimized profiles are not compatible.

See also

The Profile.to_optimized method, which allows getting an OptimizedProfile directly from a profile without having to allocate first.

copy()#

Create an exact copy of the optimized profile.

Note

The Alphabet referenced to by this object is not copied, use copy.deepcopy if this is the intended behaviour.

ssv_filter(seq)#

Compute the SSV filter score for the given sequence.

Parameters:

seq (DigitalSequence) – The sequence in digital format for which to compute the SSV filter score.

Returns:

float or None – The SSV filter score for the sequence.

Note

  • math.inf may be returned if an overflow occurs that will also occur in the MSV filter. This is the case whenever \(\text{base} - \text{tjb} - \text{tbm} \ge 128\)

  • None may be returned if the MSV filter score needs to be recomputed (because it may not overflow even though the SSV filter did).

Raises:

AlphabetMismatch – When the alphabet of the sequence does not correspond to the profile alphabet.

Caution

This method is not available on the PowerPC platform (calling it will raise a NotImplementedError).

Added in version 0.4.0.

write(fh_filter, fh_profile)#

Write an optimized profile to two separate files.

HMMER implements an acceleration pipeline using several scoring algorithms. Parameters for MSV (the Multi ungapped Segment Viterbi) are saved independently to the fh_filter handle, while the rest of the profile is saved to fh_profile.

L#

The currently configured target sequence length.

Added in version 0.4.0.

Type:

int

M#

The number of nodes in the model.

Added in version 0.4.0.

Type:

int

accession#

The accession of the profile, if any.

Added in version 0.4.11.

Type:

bytes or None

base_b#

The offset for MSV filter scores.

Added in version 0.10.3.

Type:

int

base_w#

The offset for Viterbi filter scores.

Added in version 0.10.3.

Type:

int

bias_b#

The positive bias to emission scores.

Added in version 0.4.0.

Changed in version 0.10.3: Renamed from bias.

Type:

int

compositions#

The per-model HMM filter composition.

Added in version 0.11.0.

Type:

VectorF

consensus#

The consensus residue line of the profile, if any.

Added in version 0.4.11.

Type:

str or None

consensus_structure#

The consensus structure of the profile, if any.

Added in version 0.4.11.

Type:

str or None

cutoffs#

The bitscore cutoffs for this profile.

Type:

Cutoffs

ddbound_w#

The threshold precalculated for lazy \(DD\) evaluation.

Added in version 0.10.3.

Type:

int

description#

The description of the profile, if any.

Added in version 0.4.11.

Type:

bytes or None

evalue_parameters#

The e-value parameters for this profile.

Type:

EvalueParameters

local#

Whether the optimized profile is in local mode.

Added in version 0.7.0.

Type:

bool

model_mask#

The model mask line from the alignment, if any.

Added in version 0.3.1.

Type:

str or None

multihit#

Whether the optimized is in multihit mode.

Added in version 0.7.0.

Type:

bool

name#

The name of the profile, if any.

Added in version 0.4.11.

Type:

bytes or None

ncj_roundoff#

The missing precision on \(NN,CC,JJ\) after rounding.

Added in version 0.10.3.

Type:

float

offsets#

The disk offsets for this optimized profile.

Type:

Offsets

rbv#

The match scores for the MSV filter.

Type:

MatrixU8

reference#

The reference line from the alignment, if any.

This is relevant if the profile was built from a multiple sequence alignment (e.g. by Builder.build_msa, or by an external hmmbuild pipeline run).

Added in version 0.3.1.

Type:

str or None

rfv#

The match scores for the Forward/Backward.

Added in version 0.10.3.

Type:

MatrixF

sbv#

The match scores for the SSV filter.

Added in version 0.4.0.

Type:

MatrixU8

scale_b#

The scale for MSV filter scores.

Added in version 0.10.3.

Type:

float

scale_w#

The scale for Viterbi filter scores.

Added in version 0.10.3.

Type:

float

tbm#

The constant cost for a \(B \to M_k\) transition.

Added in version 0.4.0.

Type:

int

tec#

The constant cost for a \(E \to C\) transition.

Added in version 0.4.0.

Type:

int

tfv#

The transition scores for the Forward/Backard.

Added in version 0.10.3.

Type:

VectorF

tjb#

The constant cost for a \(NCJ\) move.

Added in version 0.4.0.

Type:

int

xf#

The \(NECJ\) transition costs.

Added in version 0.10.3.

Type:

MatrixF

class pyhmmer.plan7.OptimizedProfileBlock#

A container for storing OptimizedProfile objects.

This collections allows the scan loop of Pipeline objects to scan several target profiles without having to acquire the GIL for each new model. It does so by synchronizing a Python list storing OptimizedProfile objects with a C-contiguous array of pointers to the underlying struct of each object.

Added in version 0.7.0.

__copy__()#

None

__init__(alphabet, iterable=())#

Create a new block from an iterable of OptimizedProfile objects.

__reduce__()#

None

__sizeof__()#

None

append(optimized_profile)#

Append an optimized profile at the end of the block.

clear()#

Remove all optimized profiles from the block.

copy()#

Return a copy of the optimized profile block.

Note

The optimized profiles internally refered to by this collection are not copied. Use copy.deepcopy if you also want to duplicate the internal storage of each optimized profile.

extend(iterable)#

Extend block by appending optimized profiles from the iterable.

index(optimized_profile, start=0, stop=9223372036854775807)#

Return the index of the first occurence of optimized_profile.

Raises:

ValueError – When the block does not contain optimized_profile.

insert(index, optimized_profile)#

Insert a new optimized profile in the block before index.

pop(index=-1)#

Remove and return an optimized profile from the block.

remove(optimized_profile)#

Remove the first occurence of the given optimized profile.