Plan7

High-level interface to the Plan7 data model.

Plan7 is the model architecture used by HMMER since HMMER2.

See also

Details about the Plan 7 architecture in the HMMER documentation.

Profile HMMs

HMM

class pyhmmer.plan7.HMM

A data structure storing the Plan7 Hidden Markov Model.

alphabet

The alphabet of the model.

Type:

Alphabet

Changed in version 0.4.6: Added the evalue_parameters and cutoffs attributes.

__copy__()

None

__deepcopy__(memo)

None

__getstate__()

None

__init__()

Create a new HMM from scratch.

Parameters:
  • alphabet (Alphabet) – The alphabet of the model.

  • M (int) – The length of the model (i.e. the number of nodes).

  • name (bytes) – The name of the model.

__reduce__()

None

__setstate__(state)

None

__sizeof__()

None

copy()

Return a copy of the HMM with the exact same configuration.

Added in version 0.3.0.

match_occupancy()

Calculate the match occupancy for each match state.

Returns:

VectorF – A vector of size \(M+1\) containing the probability that each match state is used in a sample glocal path through the model.

Added in version 0.4.10.

mean_match_entropy()

Compute the mean entropy per HMM match state.

The mean entropy per match state emission distribution is defined as:

\[- \frac{1}{M} \sum_{k=1}^{M} \sum_x p_k(x) \log_2 p_k(x)\]

where \(p_k(x)\) is the emission probability for symbol \(x\) from match state \(k\).

Returns:

float – The mean entropy, in bits.

Example

>>> thioesterase.mean_match_entropy()
3.0425...

Added in version 0.4.10.

mean_match_information(background)

Compute the mean information content of the HMM match states.

The mean information content per match state emission distribution is defined as:

\[\frac{1}{M} \sum_{k=1}^{M} \left[ \sum_x f(x) \log_2 f(x) - \sum_x p_k(x) \log_2 p_k(x) \right]\]

in bits, where \(p_k(x)\) is the emission probability for symbol \(x\) from match state \(k\), and \(f(x)\) is the background emission probability for \(x\) from the null model.

Parameters:

background (Background) – The null model from which to get the background emission probabilities.

Returns:

float – The mean information content, in bits.

Example

>>> background = plan7.Background(easel.Alphabet.amino())
>>> thioesterase.mean_match_information(background)
1.1330...

Added in version 0.4.10.

mean_match_relative_entropy(background)

Compute the mean relative entropy per HMM match state.

The mean relative entropy per match state emission distribution is defined as:

\[\frac{1}{M} \sum_{k=1}^{M} \sum_x { p_k(x) \log_2 \frac{p_k(x)}{f(x)} }\]

in bits, where \(p_k(x)\) is the emission probability for symbol \(x\) from match state \(k\), and \(f(x)\) is the background emission probability for \(x\) from the null model.

Parameters:

background (Background) – The null model from which to get the background emission probabilities.

Returns:

float – The mean relative entropy, in bits.

Example

>>> background = plan7.Background(easel.Alphabet.amino())
>>> thioesterase.mean_match_relative_entropy(background)
1.1201...

Added in version 0.4.10.

renormalize()

Renormalize all parameter vectors (emissions and transitions).

Added in version 0.4.0.

classmethod sample(alphabet, M, randomness, ungapped=False, enumerable=False)

Sample an HMM of length M at random.

Parameters:
  • alphabet (Alphabet) – The alphabet of the model.

  • M (int) – The length of the model to generate (i.e. the number of nodes).

  • randomness (Randomness) – The random number generator to use for sampling.

  • ungapped (bool) – Set to True to build an ungapped HMM, i.e. an HMM where the \(M_n o M_{n+1}\) are all one and the remaining transitions are zero. Ignored when enumerable is True.

  • enumerable (bool) – Set to True to build a random HMM with no nonzero insertion transitions.

Returns:

HMM – A new HMM generated at random.

Added in version 0.7.0.

scale(scale, exponential=False)

Rescale counts by a factor in a model containing counts.

This method only affects core probability model emissions and transitions.

Parameters:
  • scale (float) – The scaling factor to use (\(1.0\) for no scaling). Often computed using the ratio of effective sequences (\(\frac{n_{eff}}{n_{seq}}\))

  • exponential (bool) – When set to True, use scale as an exponential factor (\(C_i = C_i^s\)) instead of a multiplicative factor (\(C_i = C_i \times s\)), resulting in a non-uniform scaling across columns. This can be useful when some heavily fragmented sequences are used to reconstruct a family MSA.

Added in version 0.4.0.

set_composition()

Calculate and set the model composition.

Added in version 0.4.0.

set_consensus(sequence=None)

Set the model consensus sequence.

If given a sequence, the HMM is assumed to originate from a single-sequence model, and the sequence is used as the consensus. Otherwise, the consensus is computed by extracting the consensus at each position.

Residues in the consensus sequence are uppercased when their emission probabilities are above an arbitrary, alphabet-specific threshold (\(0.9\) for nucleotide alphabets, \(0.5\) for protein).

Parameters:

sequence (DigitalSequence) – A sequence in digital mode with \(M\) residues and the same alphabet as the HMM.

Raises:

Added in version 0.10.1.

to_profile(background=None, L=400, multihit=True, local=True)

Create a new profile configured for this HMM.

This method is a shortcut for creating a new Profile and calling configure for a given HMM. Prefer manually calling configure to recycle the profile buffer when running inside a loop.

Parameters:
  • background (Background, optional) – The null background model. In None given, create a default one for the HMM alphabet.

  • L (int) – The expected target sequence length.

  • multihit (bool) – Whether or not to use multihit mode.

  • local (bool) – Whether to use local or global mode.

Added in version 0.8.0.

validate(tolerance=0.0001)

Validate the HMM agains the Plan7 structural constraints.

HMMs created with PyHMMER are always valid, whether they come from a Builder or from the HMM constructor. However, it is possible to manually edit the emission scores and transition probabilities, and the structural constrains may not hold. Call this method to make sure the HMM is valid after you are done editing it.

Parameters:

tolerance (float) – The tolerance with which to check that the transition probabilities sum to 1.

Raises:

ValueError – When the HMM fails validation.

Added in version 0.8.1.

write(fh, binary=False)

Write the HMM to a file handle.

Parameters:
  • fh (io.IOBase) – A Python file handle, opened in binary mode (this must be the case even with binary=False, since the C code will emit bytes in either case).

  • binary (bool) – Pass False to emit the file in ASCII mode using the latest supported HMMER format, or True to use the binary HMMER3 format.

zero()

Set all parameters to zero, including model composition.

M

The length of the model (i.e. the number of nodes).

Type:

int

accession

The accession of the HMM, if any.

Type:

bytes or None

checksum

The 32-bit checksum of the HMM, if any.

The checksum if calculated from the alignment the HMM was created from, and was introduced in more recent HMM formats. This means some HMM objects may have a non-None checksum.

Added in version 0.2.1.

Changed in version 0.3.1: Returns None if the HMM flag for the checksum is not set.

Type:

int or None

command_line

The command line that built the model.

For HMMs created with Builder, this defaults to sys.argv. It can however be set to any string, including multiline to show successive commands.

Example

>>> print(thioesterase.command_line)
hmmbuild Thioesterase.hmm Thioesterase.fa
hmmcalibrate Thioesterase.hmm

Added in version 0.3.1.

Type:

str or None

composition

The model composition.

May not be available for freshly-created HMMs. To get the mean residue composition emitted by the model, the set_composition method must be called to compute the composition from occupancy.

Note

Although the allocated buffer in the P7_HMM object is always the same dimension so that it can store the largest possible alphabet, we only expose the relevant residues. This means that the vector will be of size alphabet.K:

>>> dna = easel.Alphabet.dna()
>>> dna.K
4
>>> hmm = plan7.HMM(dna, 100, b"test")
>>> hmm.set_composition()
>>> len(hmm.composition)
4

Added in version 0.4.0.

Type:

VectorF or None

consensus

The consensus residue line of the HMM, if set.

Added in version 0.3.0.

Type:

str or None

consensus_accessibility

The consensus accessibility of the HMM, if any.

Added in version 0.3.1.

Type:

str or None

consensus_structure

The consensus structure of the HMM, if any.

Added in version 0.3.1.

Type:

str or None

creation_time

The creation time of the HMM, if any.

Example

Get the creation time for any HMM:

>>> thioesterase.creation_time
datetime.datetime(2008, 11, 25, 17, 28, 32)

Set the creation time manually to a different date and time:

>>> ctime = datetime.datetime(2021, 8, 23, 23, 59, 19)
>>> thioesterase.creation_time = ctime
>>> thioesterase.creation_time
datetime.datetime(2021, 8, 23, 23, 59, 19)

Danger

Internally, libhmmer always uses asctime to generate a timestamp for the HMMs, so this property assumes that every creation time field can be parsed into a datetime.datetime object using the "%a %b %d %H:%M:%S %Y" format.

Added in version 0.4.6.

Type:

datetime.datetime or None

cutoffs

The bitscore cutoffs for this HMM.

Type:

Cutoffs

description

The description of the HMM, if any.

Type:

bytes or None

evalue_parameters

The e-value parameters for this HMM.

Type:

EvalueParameters

insert_emissions

The insert emissions of the model.

The property exposes a matrix of shape \((M+1, K)\), with one row per node and one column per alphabet symbol.

Caution

If editing this matrix manually, note that rows must contain valid probabilities for the HMM to be valid: each row must contains values between 0 and 1, and sum to 1. Consider calling HMM.validate after manual edition.

Added in version 0.3.1.

Changed in version 0.4.0: This property is now a MatrixF, and stores row 0.

Type:

MatrixF

match_emissions

The match emissions of the model.

The property exposes a matrix of shape \((M+1, K)\), with one row per node and one column per alphabet symbol.

Note

Since the first row corresponds to the entry probabilities, the emissions are unused. By convention, it should still contain valid probabilities, so it will always be set as follow with 1 probability for the first symbol, and 0 for the rest:

>>> hmm = HMM(easel.Alphabet.dna(), 100, b"test")
>>> hmm.match_emissions
MatrixF([[1.0, 0.0, 0.0, 0.0], ...])

Caution

If editing this matrix manually, note that rows must contain valid probabilities for the HMM to be valid: each row must contains values between 0 and 1, and sum to 1. Consider calling HMM.validate after manual edition.

Added in version 0.3.1.

Changed in version 0.4.0: This property is now a MatrixF, and stores row 0.

Type:

MatrixF

model_mask

The model mask line from the alignment, if any.

Added in version 0.3.1.

Type:

str or None

name

The name of the HMM.

Type:

bytes

nseq

The number of training sequences used, if any.

If the HMM was created from a multiple sequence alignment, this corresponds to the number of sequences in the MSA.

Example

>>> thioesterase.nseq
278

Added in version 0.3.1.

Type:

int or None

nseq_effective

The number of effective sequences used, if any.

Added in version 0.3.1.

Type:

float or None

reference

The reference line from the alignment, if any.

This is relevant if the HMM was built from a multiple sequence alignment (e.g. by Builder.build_msa, or by an external hmmbuild pipeline run).

Added in version 0.3.1.

Type:

str or None

transition_probabilities

The transition probabilities of the model.

The property exposes a matrix of shape \((M+1, 7)\), with one row per node (plus one extra row for the entry probabilities), and one column per transition.

Columns correspond to the following transitions, in order: \(M_n \to M_{n+1}\), \(M_n \to I_n\), \(M_n \to D_{n+1}\), \(I_n \to I_n\), \(I_n \to M_{n+1}\), \(D_n \to D_{n+1}\), \(D_n \to M_{n+1}\). Use the Transitions enum instead of hardcoded indices to make your code more legible.

Example

>>> t = thioesterase.transition_probabilities
>>> t[1, Transitions.MM]
0.999...
>>> t[0, Transitions.DM]  # 1 by convention for the first node
1.0
>>> t[-1, Transitions.MD] # 0 by convention for the last node
0.0

Caution

If editing this matrix manually, note that some invariants need to hold for the HMM to be valid: \(I_n\), \(M_n\) and \(D_n\) transition probabilities should only contain probabilities between 0 and 1, and sum to 1:

>>> t = thioesterase.transition_probabilities
>>> t[50, Transitions.MM] + t[50, Transitions.MI] + t[50, Transitions.MD]
1.000...
>>> t[50, Transitions.IM] + t[50, Transitions.II]
1.000...
>>> t[50, Transitions.DM] + t[50, Transitions.DD]
1.000...

Consider calling HMM.validate after manual edition.

Added in version 0.3.1.

Changed in version 0.4.0: This property is now a MatrixF.

Type:

MatrixF

Profile

class pyhmmer.plan7.Profile

A Plan7 search profile.

alphabet

The alphabet for which this profile was configured.

Type:

Alphabet

Changed in version 0.4.6: Added the evalue_parameters and cutoffs attributes.

__copy__()

None

__deepcopy__(memo)

None

__init__()

Create a new profile for the given alphabet.

Parameters:
  • M (int) – The length of the profile, i.e. the number of nodes.

  • alphabet (Alphabet) – The alphabet to use with this profile.

__sizeof__()

None

clear()

Clear internal buffers to reuse the profile without reallocation.

configure(hmm, background, L=400, multihit=True, local=True)

Configure a search profile using the given models.

Parameters:
  • hmm (HMM) – The model HMM with core probabilities.

  • background (Background) – The null background model.

  • L (int) – The expected target sequence length.

  • multihit (bool) – Whether or not to use multihit modes.

  • local (bool) – Whether or not to use non-local modes.

copy()

Return a copy of the profile with the exact same configuration.

to_optimized()

Convert the profile to a platform-specific optimized profile.

Returns:

OptimizedProfile – The platform-specific optimized profile built using the configuration of this profile.

L

The current configured target sequence length.

Type:

int

M

The length of the profile (i.e. the number of nodes).

Added in version 0.3.0.

Type:

int

accession

The accession of the profile, if any.

Added in version 0.3.0.

Type:

bytes or None

consensus

The consensus residue line of the profile, if set.

Added in version 0.4.1.

Type:

str or None

consensus_structure

The consensus structure of the profile, if any.

Added in version 0.4.1.

Type:

str or None

cutoffs

The bitscore cutoffs for this profile.

Type:

Cutoffs

description

The description of the profile, if any.

Added in version 0.3.0.

Type:

bytes or None

evalue_parameters

The e-value parameters for this profile.

Type:

EvalueParameters

local

Whether the profile is in local mode.

Added in version 0.7.0.

Type:

bool

multihit

Whether the profile is in multihit mode.

Added in version 0.7.0.

Type:

bool

name

The name of the profile, if any.

Added in version 0.3.0.

Type:

bytes or None

offsets

The disk offsets for this profile.

Type:

Offsets

OptimizedProfile

class pyhmmer.plan7.OptimizedProfile

An optimized profile that uses platform-specific instructions.

Optimized profiles store the match emissions and transition probabilities for a profile HMM so that they can be loaded in the SIMD code. Typically, matrices use aligned storage so that they can loaded efficiently, and are striped à la Farrar to compute pairwise scores for each sequence residue and profile node.

alphabet

The alphabet for which this optimized profile is configured.

Type:

Alphabet

References

  • Farrar, Michael. Striped Smith–Waterman Speeds Database Searches Six Times over Other SIMD Implementations. Bioinformatics 23, no. 2 (15 January 2007): 156–61. doi:10.1093/bioinformatics/btl582.

__copy__()

None

__deepcopy__(memo)

None

__init__()

Create a new optimized profile from scratch.

Once allocated, you must call the convert method with a Profile object. It’s actually easier to use Profile.to_optimized method to obtained a configured OptimizedProfile directly, unless you’re explicitly trying to recycle memory.

Parameters:
  • M (int) – The length of the model (i.e. the number of nodes).

  • alphabet (Alphabet) – The alphabet of the model.

__sizeof__()

None

convert(profile)

Store the given profile into self as a platform-specific profile.

Use this method to configure an optimized profile from a Profile while recycling the internal vector buffers.

Raises:

ValueError – When the optimized profile is too small to hold the profile, or when the standard and the optimized profiles are not compatible.

See also

The Profile.to_optimized method, which allows getting an OptimizedProfile directly from a profile without having to allocate first.

copy()

Create an exact copy of the optimized profile.

Note

The Alphabet referenced to by this object is not copied, use copy.deepcopy if this is the intended behaviour.

ssv_filter(seq)

Compute the SSV filter score for the given sequence.

Parameters:

seq (DigitalSequence) – The sequence in digital format for which to compute the SSV filter score.

Returns:

float or None – The SSV filter score for the sequence.

Note

  • math.inf may be returned if an overflow occurs that will also occur in the MSV filter. This is the case whenever \(\text{base} - \text{tjb} - \text{tbm} \ge 128\)

  • None may be returned if the MSV filter score needs to be recomputed (because it may not overflow even though the SSV filter did).

Raises:

AlphabetMismatch – When the alphabet of the sequence does not correspond to the profile alphabet.

Caution

This method is not available on the PowerPC platform (calling it will raise a NotImplementedError).

Added in version 0.4.0.

write(fh_filter, fh_profile)

Write an optimized profile to two separate files.

HMMER implements an acceleration pipeline using several scoring algorithms. Parameters for MSV (the Multi ungapped Segment Viterbi) are saved independently to the fh_filter handle, while the rest of the profile is saved to fh_profile.

L

The currently configured target sequence length.

Added in version 0.4.0.

Type:

int

M

The number of nodes in the model.

Added in version 0.4.0.

Type:

int

accession

The accession of the profile, if any.

Added in version 0.4.11.

Type:

bytes or None

base_b

The offset for MSV filter scores.

Added in version 0.10.3.

Type:

int

base_w

The offset for Viterbi filter scores.

Added in version 0.10.3.

Type:

int

bias_b

The positive bias to emission scores.

Added in version 0.4.0.

Changed in version 0.10.3: Renamed from bias.

Type:

int

compositions

The per-model HMM filter composition.

Added in version 0.11.0.

Type:

VectorF

consensus

The consensus residue line of the profile, if any.

Added in version 0.4.11.

Type:

str or None

consensus_structure

The consensus structure of the profile, if any.

Added in version 0.4.11.

Type:

str or None

cutoffs

The bitscore cutoffs for this profile.

Type:

Cutoffs

ddbound_w

The threshold precalculated for lazy \(DD\) evaluation.

Added in version 0.10.3.

Type:

int

description

The description of the profile, if any.

Added in version 0.4.11.

Type:

bytes or None

evalue_parameters

The e-value parameters for this profile.

Type:

EvalueParameters

local

Whether the optimized profile is in local mode.

Added in version 0.7.0.

Type:

bool

model_mask

The model mask line from the alignment, if any.

Added in version 0.3.1.

Type:

str or None

multihit

Whether the optimized is in multihit mode.

Added in version 0.7.0.

Type:

bool

name

The name of the profile, if any.

Added in version 0.4.11.

Type:

bytes or None

ncj_roundoff

The missing precision on \(NN,CC,JJ\) after rounding.

Added in version 0.10.3.

Type:

float

offsets

The disk offsets for this optimized profile.

Type:

Offsets

rbv

The match scores for the MSV filter.

Type:

MatrixU8

reference

The reference line from the alignment, if any.

This is relevant if the profile was built from a multiple sequence alignment (e.g. by Builder.build_msa, or by an external hmmbuild pipeline run).

Added in version 0.3.1.

Type:

str or None

rfv

The match scores for the Forward/Backward.

Added in version 0.10.3.

Type:

MatrixF

sbv

The match scores for the SSV filter.

Added in version 0.4.0.

Type:

MatrixU8

scale_b

The scale for MSV filter scores.

Added in version 0.10.3.

Type:

float

scale_w

The scale for Viterbi filter scores.

Added in version 0.10.3.

Type:

float

tbm

The constant cost for a \(B \to M_k\) transition.

Added in version 0.4.0.

Type:

int

tec

The constant cost for a \(E \to C\) transition.

Added in version 0.4.0.

Type:

int

tfv

The transition scores for the Forward/Backard.

Added in version 0.10.3.

Type:

VectorF

tjb

The constant cost for a \(NCJ\) move.

Added in version 0.4.0.

Type:

int

xf

The \(NECJ\) transition costs.

Added in version 0.10.3.

Type:

MatrixF

OptimizedProfileBlock

class pyhmmer.plan7.OptimizedProfileBlock

A container for storing OptimizedProfile objects.

This collections allows the scan loop of Pipeline objects to scan several target profiles without having to acquire the GIL for each new model. It does so by synchronizing a Python list storing OptimizedProfile objects with a C-contiguous array of pointers to the underlying struct of each object.

Added in version 0.7.0.

__copy__()

None

__init__(*args, **kwargs)
__reduce__()

None

__sizeof__()

None

append(optimized_profile)

Append an optimized profile at the end of the block.

clear()

Remove all optimized profiles from the block.

copy()

Return a copy of the optimized profile block.

Note

The optimized profiles internally refered to by this collection are not copied. Use copy.deepcopy if you also want to duplicate the internal storage of each optimized profile.

extend(iterable)

Extend block by appending optimized profiles from the iterable.

index(optimized_profile, start=0, stop=9223372036854775807)

Return the index of the first occurence of optimized_profile.

Raises:

ValueError – When the block does not contain optimized_profile.

insert(index, optimized_profile)

Insert a new optimized profile in the block before index.

pop(index=-1)

Remove and return an optimized profile from the block.

remove(optimized_profile)

Remove the first occurence of the given optimized profile.

HMM Files

HMMFile

class pyhmmer.plan7.HMMFile

A wrapper around a file (or database), storing serialized HMMs.

You can use this class to iterate on the HMMs inside a file, loading them into HMM objects.

Example

Load the first HMM from an HMM file located on the local filesystem:

>>> with HMMFile("tests/data/hmms/txt/PF02826.hmm") as hmm_file:
...     hmm = hmm_file.read()
>>> hmm.name
b'2-Hacid_dh_C'
>>> hmm.accession
b'PF02826.20'

Load all the HMMs from an HMM file into a list:

>>> with HMMFile("tests/data/hmms/txt/RREFam.hmm") as hmm_file:
...     hmms = list(hmm_file)
>>> len(hmms)
28
>>> hmms[0].accession
b'RREFam008.1'
__enter__()

None

__exit__(exc_type, exc_value, traceback)

None

__init__()

Create a new HMM reader from the given file.

Parameters:
  • file (str, bytes, os.PathLike or file-like object) – Either the path to a file containing the HMMs to read, or a file-like object in binary mode.

  • db (bool) – Set to False to force the parser to ignore the pressed HMM database if it finds one. Defaults to True.

close()

Close the HMM file and free resources.

This method has no effect if the file is already closed. It is called automatically if the HMMFile was used in a context:

>>> with HMMFile("tests/data/hmms/bin/PKSI-AT.h3m") as hmm_file:
...     hmm = hmm_file.read()
is_pressed()

Check whether the HMM file is a pressed HMM database.

A pressed database is an HMMER format to store optimized profiles in several files on the disk. It can be used to reduce the time needed to process sequences by cutting down the time needed to convert from an HMM to an OptimizedProfile.

Example

>>> HMMFile("tests/data/hmms/txt/PKSI-AT.hmm").is_pressed()
False
>>> HMMFile("tests/data/hmms/bin/PKSI-AT.h3m").is_pressed()
False
>>> HMMFile("tests/data/hmms/db/PKSI-AT.hmm").is_pressed()
True

Added in version 0.4.11.

optimized_profiles()

Get an iterator over the OptimizedProfile in the HMM database.

Returns:

HMMPressedFile – An iterator over the optimized profiles in a pressed HMM database.

Added in version 0.4.11.

read()

Read the next HMM from the file.

Returns:

HMM – The next HMM in the file, or None if all HMMs were read from the file already.

Raises:
  • ValueError – When attempting to read a HMM from a closed file, or when the file could not be parsed.

  • AllocationError – When memory for the HMM could not be allocated successfully.

Added in version 0.4.11.

rewind()

Rewind the file back to the beginning.

closed

Whether the HMMFile is closed or not.

Type:

bool

name

The path to the HMM file, if known.

Added in version 0.7.0.

Type:

str or None

HMMPressedFile

class pyhmmer.plan7.HMMPressedFile

An iterator over each OptimizedProfile in a pressed HMM database.

This class cannot be instantiated: use the HMMFile.optimized_profiles to obtain an instance from an HMMFile that wraps a pressed HMM database.

Example

Use a pressed HMM database to run a search pipeline using optimized profiles directly, instead of converting them from the text HMMs:

>>> with HMMFile("tests/data/hmms/db/Thioesterase.hmm") as hfile:
...     models = hfile.optimized_profiles()
...     hits = next(pyhmmer.hmmsearch(models, proteins))

In this example, hmmsearch will receive an iterator of OptimizedProfile instead of an iterator of HMM is hfile was passed directly; this lets hmmsearch skip the conversion step before running the search pipeline.

Added in version 0.4.11.

Changed in version 0.7.0: Allow instantiating an HMMPressedFile from a filename.

__enter__()

None

__exit__(exc_type, exc_value, traceback)

None

__init__()

Create a new pressed file from the given filename.

Parameters:

file (str, bytes or os.PathLike) – The path to the pressed HMM file containing the optimized profiles to read.

close()

Close the pressed file and free resources.

This method has no effect if the file is already closed. It is called automatically if the HMMFile was used in a context:

>>> with HMMPressedFile("tests/data/hmms/db/PKSI-AT.hmm") as hmm_db:
...     optimized_profile = hmm_db.read()
read()

Read the next optimized profile from the file.

Returns:

OptimizedProfile – The next optimized profile in the file, or None if all profiles were read from the file already.

Raises:
  • ValueError – When attempting to read an optimized profile from a closed file, or when the file could not be parsed.

  • AllocationError – When memory for the OptimizedProfile could not be allocated successfully.

Added in version 0.4.11.

rewind()

Rewind the file back to the beginning.

closed

Whether the HMMPressedFile is closed or not.

Added in version 0.7.0.

Type:

bool

name

The path to the HMM file.

Note

Unlike HMMFile.name this attribute can never be None, since a HMMPressedFile object can only be created from a filename, and not from a file-like object.

Added in version 0.7.0.

Type:

str

Pipelines

Pipeline

class pyhmmer.plan7.Pipeline

An HMMER3 accelerated sequence/profile comparison pipeline.

The Plan7 pipeline handles the platform-accelerated comparison of sequences to profile HMMs. It performs either a search (comparing a single query profile to a target sequence database) or a scan (comparing a single query sequence to a target profile database). The two methods are yielding equivalent results: if you have a collection of \(M\) sequences and \(N\) HMMs to compare, doing a search or a scan should give the same raw scores. The E-values will however be different if Z and domZ where not set manually: \(Z\) will be set to \(M\) for a search, and to \(N\) for a scan.

The main reason for which you should choose search or scan is the relative size of the sequences and HMMs databases. In the original HMMER3 code, the memory was managed in a way that you never had to load the entirety of the target sequences in memory. In PyHMMER, the methods accept both reading the target database from a file, or loading it entirely into memory. A scan is always slower than a search because of the overhead introduced when reconfiguring a profile for a new sequence.

alphabet

The alphabet for which the pipeline is configured.

Type:

Alphabet

background

The null background model to use to compute scores.

Type:

Background

randomness

The random number generator being used by the pipeline.

Type:

Randomness

Changed in version 0.4.2: Added the randomness attribute.

__init__()

Instantiate and configure a new accelerated comparison pipeline.

Parameters:
  • alphabet (Alphabet) – The biological alphabet the of the HMMs and sequences that are going to be compared. Used to build the background model.

  • background (Background, optional) – The background model to use with the pipeline, or None to create and use a default one. The pipeline needs ownership of the background model, so any background model passed there will be copied.

Keyword Arguments:
  • bias_filter (bool) – Whether or not to enable composition bias filter. Defaults to True.

  • null2 (bool) – Whether or not to compute biased composition score corrections. Defaults to True.

  • seed (int, optional) – The seed to use with the random number generator. Pass 0 to use a one-time arbitrary seed, or None to keep the default seed from HMMER.

  • Z (int, optional) – The effective number of comparisons done, for E-value calculation. Leave as None to auto-detect by counting the number of sequences queried.

  • domZ (int, optional) – The number of significant sequences found, for domain E-value calculation. Leave as None to auto-detect by counting the number of sequences reported.

  • F1 (float) – The MSV filter threshold.

  • F2 (float) – The Viterbi filter threshold.

  • F3 (float) – The uncorrected Forward filter threshold.

  • E (float) – The per-target E-value threshold for reporting a hit.

  • T (float, optional) – The per-target bit score threshold for reporting a hit. If given, takes precedence over E.

  • domE (float) – The per-domain E-value threshold for reporting a domain hit.

  • domT (float, optional) – The per-domain bit score threshold for reporting a domain hit. If given, takes precedence over domE.

  • incE (float) – The per-target E-value threshold for including a hit in the resulting TopHits.

  • incT (float, optional) – The per-target bit score threshold for including a hit in the resulting TopHits. If given, takes precedence over incE.

  • incdomE (float) – The per-domain E-value threshold for including a domain in the resulting TopHits.

  • incdomT (float, optional) – The per-domain bit score thresholds for including a domain in the resulting TopHits. If given, takes precedence over incdomE.

  • bit_cutoffs (str, optional) – The model-specific thresholding option to use for reporting hits. With None (the default), use global pipeline options; otherwise pass one of "noise", "gathering" or "trusted" to use the appropriate cutoffs.

Hint

In order to run the pipeline in slow/max mode, disable the bias filter, and set the three filtering parameters to \(1.0\):

>>> dna = easel.Alphabet.dna()
>>> max_pli = Pipeline(dna, bias_filter=False, F1=1.0, F2=1.0, F3=1.0)

Changed in version 0.4.6: Added keywords arguments to set the E-value thresholds.

arguments()

Generate an argv-like list with pipeline options as CLI flags.

Example

>>> alphabet = easel.Alphabet.amino()
>>> plan7.Pipeline(alphabet).arguments()
[]
>>> plan7.Pipeline(alphabet, F1=0.01).arguments()
['--F1', '0.01']

Added in version 0.6.0.

clear()

Reset the pipeline configuration to its default state.

iterate_hmm(query, sequences, builder=None, select_hits=None)

Run the pipeline to find homologous sequences to a query HMM.

Parameters:
  • query (HMM) – The sequence object to use to query the sequence database.

  • sequences (DigitalSequenceBlock) – The target sequences to query with the HMM.

  • builder (Builder, optional) – A HMM builder to use to convert the query and subsequent alignments to a HMM. If None is given, this method will create one with the default parameters.

  • select_hits (callable, optional) – A function or callable object for manually selecting hits during each iteration. It should take a single TopHits argument and change the inclusion of every individual hits by setting the included and dropped flags of each Hit manually.

Returns:

IterativeSearch – An iterator object yielding the hits, sequence alignment, and HMM for each iteration.

Raises:

AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query or database sequences.

Hint

This method corresponds to running jackhmmer with the query sequence against the sequences database.

Caution

Default values used for jackhmmer do not correspond to the default parameters used for creating a pipeline in the other cases. To have truly identical results to the jackhmmer results in default mode, create the Pipeline object with incE=0.001 and incdomE=0.001.

See also

The iterate_seq, which does the same operation with a query sequence instead of a query HMM, and contains more details and examples.

Changed in version 0.7.0: Targets must now be inside a DigitalSequenceBlock.

iterate_seq(query, sequences, builder=None, select_hits=None)

Run the pipeline to find homologous sequences to a query sequence.

This method implements an iterative search over a database to find all sequences homologous to a query sequence. It is very sensitive to the pipeline inclusion thresholds (incE and incdomE).

Since this method returns an iterator, the local results of each iteration will be available for inspection before starting the next one. The select_hits callback in particular can be used for manually including/excluding hits in each iteration, which is not supported in the original jackhmmer, but available on the HMMER web client.

Parameters:
  • query (DigitalSequence) – The sequence object to use to query the sequence database.

  • sequences (DigitalSequenceBlock) – The target sequences to query with the query sequence.

  • builder (Builder, optional) – A HMM builder to use to convert the query and subsequent alignments to a HMM. If None is given, this method will create one with the default parameters.

  • select_hits (callable, optional) – A function or callable object for manually selecting hits during each iteration. It should take a single TopHits argument and change the inclusion of individual hits with the Hit.include and Hit.drop methods.

Returns:

IterativeSearch – An iterator object yielding the hits, sequence alignment, and HMM for each iteration.

Raises:

AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query or database sequences.

Hint

This method corresponds to running jackhmmer with the query sequence against the sequences database.

Caution

Default values used for jackhmmer do not correspond to the default parameters used for creating a pipeline in the other cases. To have truly identical results to the jackhmmer results in default mode, create the Pipeline object with incE=0.001 and incdomE=0.001.

Example

Starting from a pipeline and a query sequence, let’s first obtain the iterator over the successive results:

>>> abc = easel.Alphabet.amino()
>>> pli = plan7.Pipeline(abc, incE=1e-3, incdomE=1e-3)
>>> iterator = pli.iterate_seq(reductase, proteins)

Once this is ready, we can keep iterating until we converge:

>>> converged = False
>>> while not converged:
...     _, hits, _, converged, _ = next(iterator)
...     print(f"Hits: {len(hits)}  Converged: {converged}")
Hits: 1  Converged: False
Hits: 2  Converged: False
Hits: 2  Converged: True

To prevent diverging searches from running infinitely, you could wrap the search in a for loop instead, using a number of maximum iterations as the upper boundary:

>>> iterator = pli.iterate_seq(reductase, proteins)
>>> max_iterations = 10
>>> for n in range(max_iterations):
...     iteration = next(iterator)
...     if iteration.converged:
...         break

Added in version 0.6.0.

Changed in version 0.7.0: Targets must now be inside a DigitalSequenceBlock.

search_msa(query, sequences, builder=None)

Run the pipeline using a query alignment against a sequence database.

Parameters:
  • query (DigitalMSA) – The multiple sequence alignment to use to query the sequence database.

  • sequences (DigitalSequenceBlock or SequenceFile) – The target sequences to query with the alignment, either pre-loaded in memory inside a pyhmmer.easel.DigitalSequenceBlock, or to be read iteratively from a SequenceFile opened in digital mode.

  • builder (Builder, optional) – A HMM builder to use to convert the query to a HMM. If None is given, it will use a default one.

Returns:

TopHits – the hits found in the sequence database.

Raises:
  • AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query.

  • ValueError – When the Builder fails to create an HMM from the given DigitalMSA query.

Hint

This method corresponds to running phmmer with the query multiple sequence alignment against the sequences database.

Caution

Internally, this method will create a new HMM from the query MSA using the Builder.build_msa method. HMMER requires that every HMM has a name, so the Builder will attempt to use the name of the query MSA to name the HMM. Passing an MSA without a name will result in an error.

Added in version 0.3.0.

Changed in version 0.7.0: Targets can be inside a DigitalSequenceBlock or a SequenceFile.

search_seq(query, sequences, builder=None)

Run the pipeline using a query sequence against a sequence database.

Parameters:
  • query (DigitalSequence) – The sequence object to use to query the sequence database.

  • sequences (DigitalSequenceBlock or SequenceFile) – The target sequences to query with the query sequence, either pre-loaded in memory inside a pyhmmer.easel.DigitalSequenceBlock, or to be read iteratively from a SequenceFile opened in digital mode.

  • builder (Builder, optional) – A HMM builder to use to convert the query to a HMM. If None is given, it will use a default one.

Returns:

TopHits – the hits found in the sequence database.

Raises:
  • AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query.

  • ValueError – When the Builder fails to create an HMM from the given DigitalSequence query.

Hint

This method corresponds to running phmmer with the query sequence against the sequences database.

Added in version 0.2.0.

Changed in version 0.7.0: Targets can be inside a DigitalSequenceBlock or a SequenceFile.

E

The per-target E-value threshold for reporting a hit.

Added in version 0.4.6.

Type:

float

F1

The MSV filter threshold.

Added in version 0.4.1.

Type:

float

F2

The Viterbi filter threshold.

Added in version 0.4.1.

Type:

float

F3

The uncorrected Forward filter threshold.

Added in version 0.4.1.

Type:

float

T

The per-target score threshold for reporting a hit.

If set to a non-None value, this threshold takes precedence over the per-target E-value threshold (Pipeline.E).

Added in version 0.4.8.

Type:

float or None

Z

The number of effective targets searched.

It is used to compute the independent e-value for each domain, and for an entire hit. If None, the parameter number will be set automatically after all the comparisons have been done. Otherwise, it can be set to an arbitrary number.

Type:

float or None

bias_filter

Whether or not to enable the biased comp HMM filter.

Added in version 0.4.1.

Type:

bool

bit_cutoffs

The model-specific thresholding option, if any.

Added in version 0.4.6.

Type:

str or None

domE

The per-domain E-value threshold for reporting a hit.

Added in version 0.4.6.

Type:

float

domT

The per-domain score threshold for reporting a hit.

If set to a non-None value, this threshold takes precedence over the per-domain E-value threshold (Pipeline.domE).

Added in version 0.4.8.

Type:

float or None

domZ

The number of significant targets.

It is used to compute the conditional e-value for each domain. If None, the parameter number will be set automatically after all the comparisons have been done, and all hits have been thresholded. Otherwise, it can be set to an arbitrary number.

Type:

float or None

incE

The per-target E-value threshold for including a hit.

Added in version 0.4.6.

Type:

float

incT

The per-target score threshold for including a hit.

If set to a non-None value, this threshold takes precedence over the per-target E-value inclusion threshold (Pipeline.incE).

Added in version 0.4.8.

Type:

float or None

incdomE

The per-domain E-value threshold for including a hit.

Added in version 0.4.6.

Type:

float

incdomT

The per-domain score threshold for including a hit.

If set to a non-None value, this threshold takes precedence over the per-domain E-value inclusion threshold (Pipeline.incdomE).

Added in version 0.4.8.

Type:

float or None

null2

Whether or not to enable the null2 score correction.

Added in version 0.4.1.

Type:

bool

scan_seq

Run the pipeline using a query sequence against a profile database.

Parameters:
Returns:

TopHits – the hits found in the profile database.

Raises:

AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query or profiles.

Caution

In the current version, this method is not optimized to use the pressed database, even if it exists. This will cause the MSV and SSV filters to be rebuilt at each iteration, which could be slow. Consider at least pre-fetching the HMM database if calling this method several times in a row.

Hint

This method corresponds to running hmmscan with the query sequence against the hmms database.

Added in version 0.4.0.

Changed in version 0.7.0: Require optimized profiles to be inside an OptimizedProfileBlock.

search_hmm

Run the pipeline using a query HMM against a sequence database.

Parameters:
  • query (HMM, Profile or OptimizedProfile) – The object to use to query the sequence database.

  • sequences (DigitalSequenceBlock or SequenceFile) – The target sequences to query with the HMM, either pre-loaded in memory inside a DigitalSequenceBlock, or to be read iteratively from a SequenceFile opened in digital mode.

Returns:

TopHits – the hits found in the sequence database.

Raises:
  • ValueError – When the pipeline is configured to use model-specific reporting thresholds but the HMM query doesn’t have the right cutoffs available.

  • AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given HMM.

Hint

This method corresponds to running hmmsearch with the query HMM against the sequences database.

Added in version 0.2.0.

Changed in version 0.4.9: Query can now be a Profile or an OptimizedProfile.

Changed in version 0.7.0: Targets can be inside a DigitalSequenceBlock or a SequenceFile.

seed

The seed given at pipeline initialization.

Setting this attribute to a different value will cause the random number generator to be reseeded immediately.

Added in version 0.2.0.

Changed in version 0.4.2: Avoid shadowing initial null seeds given on pipeline initialization.

Type:

int

LongTargetsPipeline

class pyhmmer.plan7.LongTargetsPipeline(Pipeline)

An HMMER3 pipeline tuned for long targets.

The default HMMER3 pipeline is configured not to accept target sequences longer than 100,000 residues. Although there is no strong limitation for this threshold, comparing a sequence of \(L\) residues to a profile with \(M\) nodes requires the allocation of a \(L imes M\) dynamic programming matrix.

For sequences too long, it’s actually more efficient memory-wise to use a sliding window to match the profile to the sequence. The usual comparison pipeline is then used to perform the comparison on each window, and results are merged once the entire sequence is done being processed. The context size \(C\) is large enough to accommodate for the entire profile, so that there is no risk of missing a hit in the overlaps between windows. The window size \(W\) can be changed with the block_length argument when instantiating a new LongTargetsPipeline object.

Added in version 0.4.9.

Added in version 0.10.8: The window_length and window_beta keyword arguments.

__init__()

Instantiate and configure a new long targets pipeline.

Parameters:
  • alphabet (Alphabet) – The biological alphabet the of the HMMs and sequences that are going to be compared. Used to build the background model. A nucleotide alphabet is expected.

  • background (Background, optional) – The background model to use with the pipeline, or None to create and use a default one. The pipeline needs ownership of the background model, so any background model passed there will be copied.

Keyword Arguments:
  • strand (str, optional) – The strand to use when processing nucleotide sequences. Use "watson" to use only the coding strand, "crick" to use only the reverse strand, or leave as None to process both strands.

  • B1 (int) – The window length to use for the biased-composition modifier of the MSV filter.

  • B2 (int) – The window length to use for the biased-composition modifier of the Viterbi filter.

  • B3 (int) – The window length to use for the biased-composition modifier of the Forward filter.

  • block_length (int) – The number of residues to use as the window size \(W\) when reading blocks from the long target sequences.

  • window_length (int) – The window length to use to compute E-values.

  • **kwargs – Any additional parameter will be passed to the Pipeline constructor.

arguments()

Generate an argv-like list with pipeline options as CLI flags.

Example

>>> alphabet = easel.Alphabet.dna()
>>> plan7.LongTargetsPipeline(alphabet).arguments()
[]
>>> plan7.LongTargetsPipeline(alphabet, B1=200).arguments()
['--B1', '200']

Added in version 0.6.0.

search_msa(query, sequences, builder=None)

Run the pipeline using a query alignment against a sequence database.

Parameters:
  • query (DigitalMSA) – The multiple sequence alignment to use to query the sequence database.

  • sequences (DigitalSequenceBlock) – The target sequences to query with the query alignment.

  • builder (Builder, optional) – A HMM builder to use to convert the query to a HMM. If None is given, it will use a default one.

Returns:

TopHits – the hits found in the sequence database.

Raises:

AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query.

Hint

This method corresponds to running nhmmer with the query multiple sequence alignment against the sequences database.

Changed in version 0.7.0: Targets must now be inside a DigitalSequenceBlock.

search_seq(query, sequences, builder=None)

Run the pipeline using a query sequence against a sequence database.

Parameters:
  • query (DigitalSequence) – The sequence object to use to query the sequence database.

  • sequences (DigitalSequenceBlock) – The target sequences to query with the query sequence.

  • builder (Builder, optional) – A HMM builder to use to convert the query to a HMM. If None is given, it will use a default one.

Returns:

TopHits – the hits found in the sequence database.

Raises:

AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query.

Hint

This method corresponds to running nhmmer with the query sequence against the sequences database.

Changed in version 0.7.0: Targets must now be inside a DigitalSequenceBlock.

B1

The window length for biased-composition modifier of the MSV.

Type:

int

B2

The window length for biased-composition modifier of the Viterbi.

Type:

int

B3

The window length for biased-composition modifier of the Forward.

Type:

int

scan_seq

Run the pipeline using a query sequence against a profile database.

This is currently unsupported for LongTargetsPipeline.

search_hmm

Run the pipeline using a query HMM against a sequence database.

Parameters:
Returns:

TopHits – the hits found in the sequence database.

Raises:
  • ValueError – When the pipeline is configured to use model-specific reporting thresholds but the HMM query doesn’t have the right cutoffs available.

  • AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given HMM.

Hint

This method corresponds to running nhmmer with the query HMM against the sequences database.

Changed in version 0.7.0: Targets must now be inside a DigitalSequenceBlock.

strand

The strand to process, or None for both.

Type:

str or None

window_beta

The tail mass at which window length is determined.

Type:

float

window_length

The window length for nucleotide sequences.

Type:

int or None

Builder

class pyhmmer.plan7.Builder

A factory for constructing new HMMs from raw sequences.

alphabet

The alphabet the builder is configured to use to convert sequences to HMMs.

Type:

Alphabet

randomness

The random number generator being used by the builder.

Type:

Randomness

score_matrix

The name of the substitution matrix used to build HMMs for single sequences.

Type:

str

popen

The gap open probability to use when building HMMs from single sequences.

Type:

float

pextend

The gap extend probability to use when building HMMs from single sequences.

Type:

float

Added in version 0.2.0.

Changed in version 0.4.2: Added the randomness attribute.

__copy__()

None

__init__()

Create a new sequence builder with the given configuration.

Parameters:

alphabet (Alphabet) – The alphabet the builder expects the sequences to be in.

Keyword Arguments:
  • architecture (str) – The algorithm to use to determine the model architecture, either "fast" (the default), or "hand".

  • weighting (str) – The algorithm to use for relative sequence weighting, either "pb" (the default), "gsc", "blosum", "none", or "given".

  • effective_number (str, int, or float) – The algorithm to use to determine the effective sequence number, either "entropy" (the default), "exp", "clust", "none". A number can also be given directly to set the effective sequence number manually.

  • prior_scheme (str) – The choice of mixture Dirichlet prior when parameterizing from counts, either "laplace" or "alphabet" (the default).

  • symfrac (float) – The residue occurrence threshold for fast architecture determination.

  • fragthresh (float) – A threshold such that a sequence is called a fragment when \(L \le fragthresh imes alen\).

  • wid (double) – The percent identity threshold for BLOSUM relative weighting.

  • esigma (float) – The minimum total relative entropy parameter for effective number entropy weights.

  • eid (float) – The percent identity threshold for effective number clustering.

  • EmL (int) – The length of sequences generated for MSV fitting.

  • EmN (int) – The number of sequences generated for MSV fitting.

  • EvL (int) – The lenght of sequences generated for Viterbi fitting.

  • EvN (int) – The number of sequences generated for Viterbi fitting.

  • EfL (int) – The lenght of sequences generated for Forward fitting.

  • EfN (int) – The number of sequences generated for Forward fitting.

  • Eft (float) – The tail mass used for Forward fitting.

  • seed (int) – The seed to use to initialize the internal random number generator. If 0 is given, an arbitrary seed will be chosen based on the current time.

  • ere (double, optional) – The relative entropy target for effective number weighting, or None.

  • popen (float, optional) – The gap open probability to use when building HMMs from single sequences. The default value depends on the alphabet: 0.02 for proteins, 0.03125 for nucleotides.

  • pextend (float, optional) – The gap extend probability to use when building HMMs from single sequences. Default depends on the alphabet: 0.4 for proteins, 0.75 for nucleotides.

  • score_matrix (str, optional) – The name of the score matrix to use when building HMMs from single sequences. The only allowed value for nucleotide alphabets is DNA1. For proteins, PAM30, PAM70, PAM120, PAM240, BLOSUM45, BLOSUM50, BLOSUM62 (the default), BLOSUM80 or BLOSUM90 can be used.

  • window_length (float, optional) – The window length for nucleotide sequences, essentially the max expected hit length. If given, takes precedence over window_beta.

  • window_beta (float, optional) – The tail mass at which window length is determined for nucleotide sequences.

build(sequence, background)

Build a new HMM from sequence using the builder configuration.

Parameters:
  • sequence (DigitalSequence) – A single biological sequence in digital mode to build a HMM with.

  • background (Background) – The background model to use to create the HMM.

Returns:

(HMM, Profile, OptimizedProfile) – A tuple containing the new HMM as well as profiles to be used directly in a Pipeline.

Raises:
  • AlphabetMismatch – When either sequence or background have the wrong alphabet for this builder.

  • ValueError – When the HMM cannot be created successfully from the input; the error message contains more details.

Hint

The score matrix and the gap probabilities used here can be set when initializing the builder, or changed by setting a new value to the right attribute:

>>> alphabet = easel.Alphabet.amino()
>>> background = plan7.Background(alphabet)
>>> builder = plan7.Builder(alphabet, popen=0.05)
>>> builder.score_matrix = "BLOSUM90"
>>> hmm, _, _ = builder.build(proteins[0], background)

Changed in version 0.4.6: Sets the HMM.creation_time attribute with the current time.

Changed in version 0.4.10: The score system is now cached between calls to Builder.build.

build_msa(msa, background)

Build a new HMM from msa using the builder configuration.

Parameters:
  • msa (DigitalMSA) – A multiple sequence alignment in digital mode to build a HMM with.

  • background (Background) – The background model to use to create the HMM.

Returns:

(HMM, Profile, OptimizedProfile) – A tuple containing the new HMM as well as profiles to be used directly in a Pipeline.

Raises:
  • AlphabetMismatch – When either msa or background have the wrong alphabet for this builder.

  • ValueError – When the HMM cannot be created successfully from the input; the error message contains more details.

Caution

HMMER requires that every HMM has a name, so the Builder will attempt to use the name of the sequence to name the HMM. Passing an MSA without a name will result in an error:

>>> alphabet = easel.Alphabet.amino()
>>> msa = easel.TextMSA(sequences=[
...   easel.TextSequence(name=b"ustiA", sequence="YAIG"),
...   easel.TextSequence(name=b"ustiB", sequence="YVIG")
... ])
>>> builder = plan7.Builder(alphabet)
>>> background = plan7.Background(alphabet)
>>> builder.build_msa(msa.digitize(alphabet), background)
Traceback (most recent call last):
  ...
ValueError: Could not build HMM: Unable to name the HMM.

Added in version 0.3.0.

Changed in version 0.4.6: Sets the HMM.creation_time attribute with the current time.

copy()

Create a duplicate Builder instance with the same arguments.

seed

The seed given at builder initialization.

Setting this attribute to a different value will cause the internal random number generator to be reseeded immediately.

Changed in version 0.4.2: Avoid shadowing initial null seeds given on builder initialization.

Type:

int

window_beta

The tail mass at which window length is determined.

Type:

float

window_length

The window length for nucleotide sequences.

Type:

int or None

Background

class pyhmmer.plan7.Background

The null background model of HMMER.

alphabet

The alphabet of the background model.

Type:

Alphabet

uniform

Whether or not the null model has been created with uniform frequencies.

Type:

bool

residue_frequencies

The null1 background residue frequencies.

Type:

VectorF

Changed in version 0.4.0: Added the residue_frequencies attribute.

__copy__()

None

__init__()

Create a new background model for the given alphabet.

Parameters:
  • alphabet (pyhmmer.easel.Alphabet) – The alphabet to create the background model with.

  • uniform (bool) – Whether or not to create the null model with uniform frequencies. Defaults to False.

__reduce__()

None

copy()

Create a copy of the null model with the same parameters.

L

The mean of the null model length distribution, in residues.

Type:

int

omega

The prior on null2/null3.

Added in version 0.4.0.

Type:

float

transition_probability

The null1 transition probability (\(\frac{L}{L+1}\)).

Added in version 0.4.0.

Type:

float

Results

TopHits

class pyhmmer.plan7.TopHits

An immutable ranked list of top-scoring hits.

TopHits are thresholded using the parameters from the pipeline, and are sorted by key when you obtain them from a Pipeline instance:

>>> abc = thioesterase.alphabet
>>> hits = Pipeline(abc).search_hmm(thioesterase, proteins)
>>> hits.is_sorted(by="key")
True

Use len to query the number of top hits, and the usual indexing notation to extract a particular Hit:

>>> len(hits)
1
>>> hits[0].name
b'938293.PRJEB85.HG003687_113'

Added in version 0.6.1: pickle protocol support.

__copy__()

None

__deepcopy__(memo)

None

__getstate__()

None

__init__()

Create an empty TopHits instance.

__reduce__()

None

__setstate__(state)

None

compare_ranking(ranking)

Compare current top hits to previous top hits ranking.

This method is used by jackhmmer to record the hits obtained during each iteration, so that the inner loop can converge.

Parameters:

ranking (KeyHash) – A keyhash containing the ranks of the top hits from a previous run.

Returns:

int – The number of new hits found in this iteration.

Added in version 0.6.0.

copy()

Create a copy of this TopHits instance.

Added in version 0.5.0.

is_sorted(by='key')

Check whether or not the hits are sorted with the given method.

See sort for a list of allowed values for the by argument.

merge(*others)

Concatenate the hits from this instance and others.

If the Z and domZ values used to compute E-values were computed by the Pipeline from the number of targets, the returned object will update them by summing self.Z and other.Z. If they were set manually, the manual value will be kept, provided both values are equal.

Returns:

TopHits – A new collection of hits containing a copy of all the hits from self and other, sorted by key.

Raises:

ValueError – When trying to merge together several hits obtained from different Pipeline with incompatible parameters.

Caution

This should only be done for hits obtained for the same domain on similarly configured pipelines. Some internal checks will be done to ensure this is not the case, but the results may not be consistent at all.

Example

>>> pli = Pipeline(thioesterase.alphabet)
>>> hits1 = pli.search_hmm(thioesterase, proteins[:1000])
>>> hits2 = pli.search_hmm(thioesterase, proteins[1000:2000])
>>> hits3 = pli.search_hmm(thioesterase, proteins[2000:])
>>> merged = hits1.merge(hits2, hits3)

Added in version 0.5.0.

sort(by='key')

Sort hits in the current instance using the given method.

Parameters:

by (str) – The comparison method to use to compare hits. Allowed values are: key (the default) to sort by key, or seqidx to sort by sequence index and alignment position.

to_msa(alphabet, sequences=None, traces=None, trim=False, digitize=False, all_consensus_cols=False)

Create multiple alignment of all included domains.

Parameters:
  • alphabet (Alphabet) – The alphabet of the HMM this TopHits was obtained from. It is required to convert back hits to single sequences.

  • sequences (list of Sequence, optional) – A list of additional sequences to include in the alignment.

  • traces (list of Trace, optional) – A list of additional traces to include in the alignment.

Keyword Arguments:
  • trim (bool) – Trim off any residues that get assigned to flanking \(N\) and \(C\) states (in profile traces) or \(I_0\) and \(I_m\) (in core traces).

  • digitize (bool) – If set to True, returns a DigitalMSA instead of a TextMSA.

  • all_consensus_cols (bool) – Force a column to be created for every consensus column in the model, even if it means having all gap character in a column.

Returns:

MSA – A multiple sequence alignment containing the reported hits, either a TextMSA or a DigitalMSA depending on the value of the digitize argument.

Added in version 0.3.0.

Changed in version 0.6.0: Added the sequences and traces arguments.

write(fh, format='targets', header=True)

Write the hits in tabular format to a file-like object.

Parameters:
  • fh (io.IOBase) – A Python file handle, opened in binary mode.

  • format (str) – The tabular format in which to write the hits.

  • header (bool) – Whether to write a table header. Ignored when writing in the pfam format.

Hint

The hits can be written in one of the following formats:

targets

A tabular output format of per-target hits, as obtained with the --tblout output flag of hmmsearch or hmmscan.

domains

A tabular output format of per-domain hits, as obtained with the --domtblout output flag of hmmsearch or hmmscan.

pfam

A tabular output format suitable for Pfam, merging per-sequence and per-domain hits in a single file, with fewer fields and sorted by score.

Added in version 0.6.1.

E

The per-target E-value threshold for reporting a hit.

Added in version 0.5.0.

Type:

float

T

The per-target score threshold for reporting a hit.

Added in version 0.5.0.

Type:

float or None

Z

The effective number of targets searched.

Type:

float

bit_cutoffs

The model-specific thresholding option, if any.

Added in version 0.5.0.

Type:

str or None

block_length

The block length these hits were obtained with.

Is always None when the hits were not obtained from a long targets pipeline.

Added in version 0.5.0.

Type:

int or None

domE

The per-domain E-value threshold for reporting a hit.

Added in version 0.5.0.

Type:

float

domT

The per-domain score threshold for reporting a hit.

Added in version 0.5.0.

Type:

float or None

domZ

The effective number of significant targets searched.

Type:

float

incE

The per-target E-value threshold for including a hit.

Added in version 0.5.0.

Type:

float

incT

The per-target score threshold for including a hit.

Added in version 0.4.8.

Type:

float or None

incdomE

The per-domain E-value threshold for including a hit.

Added in version 0.5.0.

Type:

float

incdomT

The per-domain score threshold for including a hit.

Added in version 0.5.0.

Type:

float or None

included

An iterator over the hits marked as included.

Added in version 0.7.0.

Type:

iterator of Hit

long_targets

Whether these hits were produced by a long targets pipeline.

Added in version 0.5.0.

Type:

bool

mode

Whether the hits were obtained in scan or search mode.

Added in version 0.9.0.

Type:

str

query_accession

The accession of the query, if any.

Added in version 0.6.1.

Type:

bytes or None

query_length

The length of the query.

Added in version 0.10.5.

Type:

int

query_name

The name of the query, if any.

Added in version 0.6.1.

Type:

bytes or None

reported

An iterator over the hits marked as reported.

Added in version 0.7.0.

Type:

iterator of Hit

searched_models

The number of models searched.

Added in version 0.5.0.

Type:

int

searched_nodes

The number of model nodes searched.

Added in version 0.5.0.

Type:

int

searched_residues

The number of residues searched.

Added in version 0.5.0.

Type:

int

searched_sequences

The number of sequences searched.

Added in version 0.5.0.

Type:

int

strand

The strand these hits were obtained from.

Is always None when the hits were not obtained from a long targets pipeline, or when the long targets pipeline was configured to search both strands.

Added in version 0.5.0.

Type:

str or None

Hit

class pyhmmer.plan7.Hit

A high-scoring database hit found by the comparison pipeline.

A hit is obtained in HMMER for every target where one or more significant domain alignment was found by a Pipeline. A Hit comes with a score, which is obtained after correcting of the individual bit scores of all its domains; a P-value, which is computed by testing the likelihood to obtain the same alignment using a random background model; and an E-value, which is obtained after Bonferonni correction of the p-value, taking into account the total number of targets in the target database.

Hits also store several information as flags. Hit.included and Hit.reported show whether a Hit is considered for inclusion (resp. reporting) with respects to the thresholds defined on the original Pipeline. These flags can be modified manually to force inclusion or exclusion of certains hits independently of their score or E-value. The write method of TopHits objects will only write a line for hits marked as reported. Included hits are necessarily reported:

\[\text{included} \implies \text{reported}\]

When used during an iterative search, hits can also be marked as dropped by setting the Hit.dropped flag to False. Dropped hits will not be used for building HMMs during the next iteration. Hits newly found in an iteration will be marked as new with the Hit.new flag. Hit.dropped and Hit.included are mutually exclusive, and setting one will unset the other. Dropped hits can be reported, but are not included:

\[\text{dropped} \implies \neg \text{included}\]

When running a long target pipeline, some hits may appear as duplicates if they were found across multiple windows. These hits will be marked as duplicates with the Hit.duplicate flag. Duplicate hits are neither reported nor included:

\[\text{duplicate} \implies \neg \text{reported}\]

Added in version 0.6.1: pickle protocol support.

__getstate__()

None

__setstate__(state)

None

accession

The accession of the database hit, if any.

Type:

bytes or None

best_domain

The best scoring domain in this hit.

Added in version 0.4.2.

Type:

Domain

bias

The null2 contribution to the uncorrected score.

Type:

float

description

The description of the database hit, if any.

Type:

bytes or None

domains

The list of domains aligned to this hit.

Type:

Domains

dropped

Whether this hit is marked as dropped.

Added in version 0.7.0.

Type:

bool

duplicate

Whether this hit is marked as duplicate.

Added in version 0.7.0.

Type:

bool

evalue

The e-value of the hit.

Type:

float

included

Whether this hit is marked as included.

Added in version 0.7.0.

Type:

bool

length

The length of the database hit.

Added in version 0.10.5.

Type:

int

name

The name of the database hit.

Type:

bytes

new

Whether this hit is marked as new.

Added in version 0.7.0.

Type:

bool

pre_score

Bit score of the sequence before null2 correction.

Type:

float

pvalue

The p-value of the bitscore.

Added in version 0.4.2.

Type:

float

reported

Whether this hit is marked as reported.

Added in version 0.7.0.

Type:

bool

score

Bit score of the sequence with all domains after correction.

Type:

float

sum_score

Bit score reconstructed from the sum of domain envelopes.

Added in version 0.4.6.

Type:

float

Domains

class pyhmmer.plan7.Domains

A read-only view over the domains of a single Hit.

hit

The target hit these domains belong hit.

Type:

Hit

included

An iterator over included domains only.

Added in version 0.7.0.

Type:

iterator of Domain

reported

An iterator over reported domains only.

Added in version 0.7.0.

Type:

iterator of Domain

Domain

class pyhmmer.plan7.Domain

A single domain in a query Hit.

hit

The hit this domains is part of.

Type:

Hit

alignment

The alignment of this domain to a target sequence.

Type:

Alignment

Added in version 0.6.1: pickle protocol support.

__getstate__()

None

__setstate__(state)

None

bias

The null2 score contribution to the domain score.

Type:

float

c_evalue

The conditional e-value for the domain.

Type:

float

correction

The null2 score when calculating a per-domain score.

Type:

float

env_from

The start coordinate of the domain envelope.

Type:

int

env_to

The end coordinate of the domain envelope.

Type:

int

envelope_score

The forward score in the envelope, without null2 correction.

Type:

float

i_evalue

The independent e-value for the domain.

Type:

float

included

Whether this domain is marked as included.

Added in version 0.7.0.

Type:

bool

pvalue

The p-value of the domain bitscore.

Type:

float

reported

Whether this domain is marked as reported.

Added in version 0.7.0.

Type:

bool

score

The overall score in bits, null2-corrected.

Type:

float

strand

The strand where the domain is located.

When running a search with the LongTargetsPipeline, both strands of each target sequence are processed (unless disabled), so the domain may be located on either strand, either + or -. For default Pipeline searches, this is always None.

Added in version 0.10.8.

Type:

str or None

Alignment

class pyhmmer.plan7.Alignment

An alignment of a sequence to a profile.

domain

The domain this alignment corresponds to.

Type:

Domain

Added in version 0.6.1: pickle protocol support.

__getstate__()

None

__setstate__(state)

None

__sizeof__()

None

hmm_accession

The accession of the query, or its name if it has none.

Added in version 0.1.4.

Type:

bytes

hmm_from

The start coordinate of the alignment in the query HMM.

Type:

int

hmm_length

The length of the query HMM in the alignment.

Added in version 0.10.5.

Type:

int

hmm_name

The name of the query HMM.

Type:

bytes

hmm_sequence

The sequence of the query HMM in the alignment.

Type:

str

hmm_to

The end coordinate of the alignment in the query HMM.

Type:

int

identity_sequence

The identity sequence between the query and the target.

Type:

str

posterior_probabilities

Posterior probability annotation of the alignment.

Added in version 0.10.5.

Type:

str

target_from

The start coordinate of the alignment in the target sequence.

Type:

int

target_length

The length of the target sequence in the alignment.

Added in version 0.10.5.

Type:

int

target_name

The name of the target sequence.

Type:

bytes

target_sequence

The sequence of the target sequence in the alignment.

Type:

str

target_to

The end coordinate of the alignment in the target sequence.

Type:

int

Traces

TraceAligner

class pyhmmer.plan7.TraceAligner

A factory for aligning several sequences to a reference model.

Example

>>> aligner = TraceAligner()
>>> traces = aligner.compute_traces(thioesterase, proteins[:100])
>>> msa = aligner.align_traces(thioesterase, proteins[:100], traces)

Added in version 0.4.7.

align_traces(hmm, sequences, traces, digitize=False, trim=False, all_consensus_cols=False)

Compute traces for a collection of sequences relative to an HMM.

Parameters:
  • hmm (HMM) – The reference HMM to use for the alignment.

  • sequences (DigitalSequenceBlock) – The target sequences to align to the HMM.

  • traces (Traces) – The traces corresponding to the alignment of sequences to hmm, obtained by a previous call to compute_traces.

  • digitize (bool) – If set to True, returns a DigitalMSA instead of a TextMSA.

  • trim (bool) – Trim off any residues that get assigned to flanking \(N\) and \(C\) states (in profile traces) or \(I_0\) and \(I_m\) (in core traces).

  • all_consensus_cols (bool) – Force a column to be created for every consensus column in the model, even if it means having all gap character in a column. Note that this is enabled by default for hmmalign since HMMER v3.1, but disabled here.

Returns:

MSA – A multiple sequence alignment containing the aligned sequences, either a TextMSA or a DigitalMSA depending on the value of the digitize argument.

Raises:
  • AlphabetMismatch – when the alphabet of any of the sequences does not correspond to the HMM alphabet.

  • InvalidHMM – when given a HMM that is not valid.

Changed in version 0.7.0: Targets must now be inside a DigitalSequenceBlock.

compute_traces(hmm, sequences)

Compute traces for a collection of sequences relative to an HMM.

Parameters:
  • hmm (HMM) – The reference HMM to use for the alignment.

  • sequences (DigitalSequenceBlock) – The target sequences to align to the HMM.

Returns:

Traces – The traces for each sequence.

Raises:
  • AlphabetMismatch – when the alphabet of any of the sequences does not correspond to the HMM alphabet.

  • InvalidHMM – when given a HMM that is not valid.

Changed in version 0.7.0: Targets must now be inside a DigitalSequenceBlock.

Traces

class pyhmmer.plan7.Traces

A list of tracebacks obtained by aligning several sequences to a model.

Added in version 0.4.7.

Trace

class pyhmmer.plan7.Trace

A traceback for the alignment of a model to a sequence.

Added in version 0.4.7.

__init__()

Create a new Trace object.

Parameters:

posteriors (bool) – Whether or not to allocate additional memory for the storage of posterior probabilties.

expected_accuracy()

Returns the sum of the posterior residue decoding probabilities.

classmethod from_sequence(sequence)

Create a faux trace from a single sequence.

Added in version 0.6.0.

score(sequence, profile)

Score traceback for target sequence using given profile.

Added in version 0.10.5.

L

The sequence length.

Type:

int

M

The model length.

Type:

int

posterior_probabilities

The posterior probability of each residue.

Type:

VectorF or None

Iterative Searches

IterativeSearch

class pyhmmer.plan7.IterativeSearch

A helper class for running iterative queries like JackHMMER.

See Pipeline.iterate_seq and Pipeline.iterate_hmm for more information.

pipeline

The pipeline object to use to get hits on each iteration.

Type:

Pipeline

builder

The builder object for converting multiple sequence alignments obtained after each run to a HMM.

Type:

Builder

query

The query object to use for the first iteration.

Type:

DigitalSequence or HMM

converged

Whether the iterative search already converged or not.

Type:

bool

targets

The targets sequences to search for homologs.

Type:

DigitalSequenceBlock

ranking

A mapping storing the rank of hits from previous iterations.

Type:

KeyHash

iteration

The index of the last iteration done so far.

Type:

int

Yields:

IterationResult – A named tuple containing the hits, multiple sequence alignment and HMM for each iteration, as well as the iteration index and a flag marking whether the search converged.

References

  • Johnson, Steven L., Eddy, Sean R. & Portugaly, Elon. Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinformatics 11, 431 (18 August 2010). doi:10.1186/1471-2105-11-431.

__init__(*args, **kwargs)
__reduce_cython__()

None

__setstate_cython__(__pyx_state)

None

_search_hmm(hmm)

None

IterationResult

class pyhmmer.plan7.IterationResult

The results of a single iteration from an IterativeSearch.

hmm

The HMM used to search for hits during this iteration.

Type:

HMM

hits

The hits found during this iteration.

Type:

TopHits

msa

A multiple sequence alignment containing the hits from this iteration.

Type:

DigitalMSA

converged

A flag marking whether this iteration converged (no new hit found in the target sequences with respect to the pipeline inclusion thresholds).

Type:

bool

iteration

The number of iterations done so far. First iteration starts at 1.

Type:

int

Miscellaneous

Cutoffs

class pyhmmer.plan7.Cutoffs

A mutable view over the score cutoffs of a HMM or a Profile.

Added in version 0.4.6.

__copy__()

None

as_vector()

Return a view over the score cutoffs as a VectorF.

gathering_available()

Check whether the gathering thresholds are available.

noise_available()

Check whether the noise cutoffs are available.

trusted_available()

Check whether the trusted cutoffs are available.

gathering

The gathering thresholds, if any.

Example

This property can be used to set the gathering cutoffs by passing it an iterable of two float:

>>> thioesterase.cutoffs.gathering = (180.0, 120.0)
>>> thioesterase.cutoffs.gathering_available()
True
>>> thioesterase.cutoffs.gathering
(180.0, 120.0)

Set the attribute to None or delete it with del to clear the gathering thresholds:

>>> thioesterase.cutoffs.gathering = None
>>> thioesterase.cutoffs.gathering_available()
False

Added in version 0.4.8.

Type:

tuple of float, or None

gathering1

The first gathering threshold, if any.

Type:

float or None

gathering2

The second gathering threshold, if any.

Type:

float or None

noise

The noise cutoffs, if available.

Added in version 0.4.8.

Type:

tuple of float, or None

noise1

The first noise cutoff, if any.

Type:

float or None

noise2

The second noise cutoff, if any.

Type:

float or None

trusted

The trusted cutoffs, if available.

Added in version 0.4.8.

Type:

tuple of float, or None

trusted1

The first trusted score cutoff, if any.

Type:

float or None

trusted2

The second trusted score cutoff, if any.

Type:

float or None

EvalueParameters

class pyhmmer.plan7.EvalueParameters

A mutable view over the e-value parameters of a HMM or a Profile.

The e-value for each filter is estimated based off a maximum likelihood distribution fitted for each profile HMM, either a Gumbel distribution for the MSV and Viterbi filters, or an exponential distribution for the Forward filter.

Added in version 0.4.6.

__copy__()

None

as_vector()

Return a view over the e-value parameters as a VectorF.

f_lambda

The \(\lambda\) parameter for the Forward filter distribution.

Type:

float or None

f_tau

The \(\tau\) parameter for the Forward filter distribution.

Type:

float or None

m_lambda

The \(\lambda\) parameter for the MSV filter distribution.

Type:

float or None

m_mu

The \(\mu\) parameter for the MSV filter distribution.

Type:

float or None

v_lambda

The \(\lambda\) parameter for the Viterbi filter distribution.

Type:

float or None

v_mu

The \(\mu\) parameter for the Viterbi filter distribution.

Type:

float or None

Offsets

class pyhmmer.plan7.Offsets

A mutable view over the disk offsets of a profile.

__copy__()

None

Transitions

class pyhmmer.plan7.Transitions(enum.IntEnum)

A helper enum for indices of the HMM transition probability matrix.

The Plan 7 model architecture used in HMMER describes a HMM which has 3 states and 7 transitions (hence the name) for every node of the model. The HMM can transition from a match state (\(M_n\)) to the next match stage (\(M_{n+1}\)), to an insertion state (\(I_n\)) or a deletion state (\(D_{n+1}\)). However, there are no transitions between a deletion and insertion state.

Added in version 0.8.1.

MM = 0
MI = 1
MD = 2
IM = 3
II = 4
DM = 5
DD = 6