Plan7

High-level interface to the Plan7 data model.

Plan7 is the model architecture used by HMMER since HMMER2.

See also

Details about the Plan 7 architecture in the HMMER documentation.

Hidden Markov Model

HMM

class pyhmmer.plan7.HMM

A data structure storing the Plan7 Hidden Markov Model.

alphabet

The alphabet of the model.

Type

Alphabet

__init__(M, alphabet)

Create a new HMM from scratch.

Parameters
  • M (int) – The length of the model (i.e. the number of nodes).

  • alphabet (Alphabet) – The alphabet of the model.

copy()

Return a copy of the HMM with the exact same configuration.

New in version 0.3.0.

renormalize()

Renormalize all parameter vectors (emissions and transitions).

New in version 0.4.0.

scale(scale, exponential=False)

In a model containing counts, rescale counts by a factor.

This method only affects core probability model emissions and transitions.

Parameters
  • scale (float) – The scaling factor to use (\(1.0\) for no scaling). Often computed using the ratio of effective sequences (\(\frac{n_{eff}}{n_{seq}}\))

  • exponential (bool) – When set to True, use scale as an exponential factor (\(C_i = C_i \exp{s}\)) instead of a multiplicative factor (:math: C_i = C_i times s), resulting in a non-uniform scaling across columns. This can be useful when some heavily fragmented sequences are used to reconstruct a family MSA.

New in version 0.4.0.

set_composition()

Calculate and set the model composition.

New in version 0.4.0.

write(fh, binary=False)

Write the HMM to a file handle.

Parameters
  • fh (io.IOBase) – A Python file handle, opened in binary mode (this must be the case even with binary=False, since the C code will emit bytes in either case).

  • binary (bool) – Pass False to emit the file in ASCII mode using the latest supported HMMER format, or True to use the binary HMMER3 format.

zero()

Set all parameters to zero, including model composition.

M

The length of the model (i.e. the number of nodes).

Type

int

accession

The accession of the HMM, if any.

Type

bytes or None

checksum

The 32-bit checksum of the HMM, if any.

The checksum if calculated from the alignment the HMM was created from, and was introduced in more recent HMM formats. This means some HMM objects may have a non-None checksum.

New in version 0.2.1.

Changed in version 0.3.1: Returns None if the HMM flag for the checksum is not set.

Type

int or None

command_line

The command line that built the model.

For HMMs created with Builder, this defaults to sys.argv. It can however be set to any string, including multiline to show successive commands.

Example

>>> print(thioesterase.command_line)
hmmbuild Thioesterase.hmm Thioesterase.fa
hmmcalibrate Thioesterase.hmm

New in version 0.3.1.

Type

str or None

composition

The model composition.

May not be available for freshly-created HMMs. To get the mean residue composition emitted by the model, the set_composition method must be called to compute the composition from occupancy.

Note

Although the allocated buffer in the P7_HMM object is always the same dimension so that it can store the largest possible alphabet, we only expose the relevant residues. This means that the vector will be of size alphabet.K:

>>> dna = easel.Alphabet.dna()  # dna.K=4
>>> hmm = plan7.HMM(100, dna)
>>> hmm.set_composition()
>>> len(hmm.composition)
4

New in version 0.4.0.

Type

VectorF or None

consensus

The consensus residue line of the HMM, if set.

New in version 0.3.0.

Type

str or None

consensus_accessibility

The consensus accessibility of the HMM, if any.

New in version 0.3.1.

Type

str or None

consensus_structure

The consensus structure of the HMM, if any.

New in version 0.3.1.

Type

str or None

description

The description of the HMM, if any.

Type

bytes or None

insert_emissions

The insert emissions of the model.

The property exposes a matrix of shape \((M+1, K)\), with one row per node and one column per alphabet symbol.

Caution

If editing this matrix manually, note that rows must contain valid probabilities for the HMM to be valid: each row must contains values between 0 and 1, and sum to 1.

New in version 0.3.1.

Changed in version 0.4.0: This property is now a MatrixF, and stores row 0.

Type

memoryview of float

match_emissions

The match emissions of the model.

The property exposes a matrix of shape \((M+1, K)\), with one row per node and one column per alphabet symbol.

Note

Since the first row corresponds to the entry probabilities, the emissions are unused. By convention, it should still contain valid probabilities, so it will always be set as follow with 1 probability for the first symbol, and 0 for the rest:

>>> hmm = HMM(100, alphabet=easel.Alphabet.dna())
>>> hmm.match_emissions[0]
VectorF([1.0, 0.0, 0.0, 0.0])

Caution

If editing this matrix manually, note that rows must contain valid probabilities for the HMM to be valid: each row must contains values between 0 and 1, and sum to 1.

New in version 0.3.1.

Changed in version 0.4.0: This property is now a MatrixF, and stores row 0.

Type

MatrixF

model_mask

The model mask line from the alignment, if any.

New in version 0.3.1.

Type

str or None

name

The name of the HMM, if any.

Type

bytes or None

nseq

The number of training sequences used, if any.

If the HMM was created from a multiple sequence alignment, this corresponds to the number of sequences in the MSA.

Example

>>> thioesterase.nseq
278

New in version 0.3.1.

Type

int or None

nseq_effective

The number of effective sequences used, if any.

New in version 0.3.1.

Type

float or None

reference

The reference line from the alignment, if any.

This is relevant if the HMM was built from a multiple sequence alignment (e.g. by Builder.build_msa, or by an external hmmbuild pipeline run).

New in version 0.3.1.

Type

str or None

transition_probabilities

The transition probabilities of the model.

The property exposes a matrix of shape \((M+1, 7)\), with one row per node (plus one extra row for the entry probabilities), and one column per transition.

Columns indices correspond to the following: 0 for \(M_n \to M_{n+1}\), 1 for \(M_n \to I_{n+1}\), 2 for \(M_n \to M_{n+1}\), 3 for \(I_n \to M_{n+1}\), 4 for \(I_n \to I_{n+1}\), 5 for \(D_n \to D_{n+1}\), 6 for \(D_n \to D_{n+1}\).

Example

>>> t = thioesterase.transition_probabilities
>>> t[0, 5]  # TDM, 1 by convention
1.0

Caution

If editing this matrix manually, note that some invariants need to hold for the HMM to be valid: \(I_n\), \(M_n\) and \(D_n\) transition probabilities should only contain probabilities between 0 and 1, and sum to 1:

>>> t = thioesterase.transition_probabilities
>>> t[50, 0] + t[50, 1] + t[50, 2]  # M_n probabilities
1.000...
>>> t[50, 3] + t[50, 4]  # I_n probabilities
1.000...
>>> t[50, 5] + t[50, 6]  # D_n probabilties
1.000...

New in version 0.3.1.

Changed in version 0.4.0: This property is now a MatrixF.

Type

MatrixF

HMM File

class pyhmmer.plan7.HMMFile

A wrapper around a file (or database), storing serialized HMMs.

__init__(file, db=True)

Create a new HMM reader from the given file.

Parameters
  • file (str or file-like object) – Either the path to a file containing the HMMs to read, or a file-like object opened in binary-mode.

  • db (bool) – Set to False to force the parser to ignore the pressed HMM database if it finds one. Defaults to False.

close()

Close the HMM file and free resources.

This method has no effect if the file is already closed. It is called automatically if the HMMFile was used in a context:

>>> with HMMFile("tests/data/hmms/bin/PKSI-AT.h3m") as hmm_file:
...     hmm = next(hmm_file)

Profile

Profile

class pyhmmer.plan7.Profile

A Plan7 search profile.

__init__(M, alphabet)

Create a new profile for the given alphabet.

Parameters
  • M (int) – The length of the profile, i.e. the number of nodes.

  • alphabet (Alphabet) – The alphabet to use with this profile.

clear()

Clear internal buffers to reuse the profile without reallocation.

configure(hmm, background, L, multihit=True, local=True)

Configure a search profile using the given models.

Parameters
  • hmm (pyhmmer.plan7.HMM) – The model HMM with core probabilities.

  • bg (pyhmmer.plan7.Background) – The null background model.

  • L (int) – The expected target sequence length.

  • multihit (bool) – Whether or not to use multihit modes.

  • local (bool) – Whether or not to use non-local modes.

copy()

Return a copy of the profile with the exact same configuration.

is_local()

Return whether or not the profile is in a local alignment mode.

is_multihit()

Returns whether or not the profile is in a multihit alignment mode.

optimized()

Convert the profile to a platform-specific optimized profile.

Returns

OptimizedProfile – The platform-specific optimized profile built using the configuration of this profile.

L

The current configured target sequence length.

Type

int

M

The length of the profile (i.e. the number of nodes).

New in version 0.3.0.

Type

int

accession

The accession of the profile, if any.

New in version 0.3.0.

Type

bytes or None

consensus

The consensus residue line of the profile, if set.

New in version 0.4.1.

Type

str or None

consensus_structure

The consensus structure of the profile, if any.

New in version 0.4.1.

Type

str or None

description

The description of the profile, if any.

New in version 0.3.0.

Type

bytes or None

name

The name of the profile, if any.

New in version 0.3.0.

Type

bytes or None

OptimizedProfile

class pyhmmer.plan7.OptimizedProfile

An optimized profile that uses platform-specific instructions.

__init__(M, alphabet)

Create a new optimized profile from scratch.

Optimized profiles use platform-specific code to accelerate the various algorithms. Although you can allocate an optimized profile yourself, the only way to obtain a fully configured profile is to create it with the Profile.optimized method, after having configured the profile for a given HMM with Profile.configure.

Parameters
  • M (int) – The length of the model (i.e. the number of nodes).

  • alphabet (Alphabet) – The alphabet of the model.

copy()

Create an exact copy of the optimized profile.

is_local()

Return whether or not the profile is in a local alignment mode.

ssv_filter()

ssv_filter(self, seq)n–

Compute the SSV filter score for the given sequence.

Parameters

seq (DigitalSequence) –

Returns

float or None – The SSV filter score for the sequence.

Note

  • math.inf may be returned if an overflow occurs that will also occur in the MSV filter. This is the case whenever \(\text{base} - \text{tjb} - \text{tbm} \ge 128\)

  • None if the MSV filter score needs to be recomputed (because it may not overflow even though the SSV filter did).

Raises

AlphabetMismatch – when the alphabet of the sequence does not correspond to the profile alphabet.

Caution

This method is not available on the PowerPC platform (calling it will raise a NotImplementedError).

New in version v0.4.0.

write(fh_filter, fh_profile)

Write an optimized profile to two separate files.

HMMER implements an acceleration pipeline using several scoring algorithms. Parameters for MSV (the Multi ungapped Segment Viterbi) are saved independently to the fh_filter handle, while the rest of the profile is saved to fh_profile.

L

The currently configured target sequence length.

New in version v0.4.0.

Type

int

M

The number of nodes in the model.

New in version v0.4.0.

Type

int

bias

The positive bias to emission scores.

New in version v0.4.0.

Type

int

rbv

The match scores for the MSV filter.

Type

MatrixU8

sbv

The match scores for the SSV filter.

New in version v0.4.0.

Type

MatrixU8

tbm

The constant cost for a \(B \to M_k\) transition.

New in version v0.4.0.

Type

int

tec

The constant cost for a \(E \to C\) transition.

New in version v0.4.0.

Type

int

tjb

The constant cost for a \(NJC\) move.

New in version v0.4.0.

Type

int

Background

class pyhmmer.plan7.Background

The null background model of HMMER.

alphabet

The alphabet of the backgound model.

Type

Alphabet

uniform

Whether or not the null model has been created with uniform frequencies.

Type

bool

residue_frequencies

The null1 background residue frequencies.

Type

VectorF

Changed in version 0.4.0: Added the residue_frequencies attribute.

__init__(alphabet, uniform=False)

Create a new background model for the given alphabet.

Parameters
  • alphabet (pyhmmer.easel.Alphabet) – The alphabet to create the background model with.

  • uniform (bool) – Whether or not to create the null model with uniform frequencies. Defaults to False.

copy()

Create a copy of the null model with the same parameters.

L

The mean of the null model length distribution, in residues.

Type

int

omega

The prior on null2/null3.

New in version 0.4.0.

Type

float

transition_probability

The null1 transition probability (\(\frac{L}{L+1}\)).

New in version 0.4.0.

Type

float

Pipelines

Pipeline

class pyhmmer.plan7.Pipeline

An HMMER3 accelerated sequence/profile comparison pipeline.

alphabet

The alphabet for which the pipeline is configured.

Type

Alphabet

background

The null background model to use to compute scores.

Type

Background

randomness

The random number generator being used by the pipeline.

Type

Randomness

Changed in version 0.4.2: Added the randomness attribute.

__init__(alphabet, background=None, *, bias_filter=True, null2=True, seed=42, Z=None, domZ=None)

Instantiate and configure a new accelerated comparison pipeline.

Parameters
  • alphabet (Alphabet) – The biological alphabet the of the HMMs and sequences that are going to be compared. Used to build the background model.

  • background (Background, optional) – The background model to use with the pipeline, or None to create and use a default one. The pipeline needs ownership of the background model, so any background model passed there will be copied.

Keyword Arguments
  • bias_filter (bool) – Whether or not to enable composition bias filter. Defaults to True.

  • null2 (bool) – Whether or not to compute biased composition score corrections. Defaults to True.

  • seed (int, optional) – The seed to use with the random number generator. Pass 0 to use a one-time arbitrary seed, or None to keep the default seed from HMMER.

  • Z (int, optional) – The effective number of comparisons done, for E-value calculation. Leave as None to auto-detect by counting the number of sequences queried.

  • domZ (int, optional) – The number of significant sequences found, for domain E-value calculation. Leave as None to auto-detect by counting the number of sequences reported.

  • F1 (float) – The MSV filter threshold.

  • F2 (float) – The Viterbi filter threshold.

  • F3 (float) – The uncorrected Forward filter threshold.

clear()

Reset the pipeline configuration to its default state.

scan_seq(query, hmms)

Run the pipeline using a query sequence against a profile database.

Parameters
  • query (DigitalSequence) – The sequence object to use to query the profile database.

  • hmms (iterable of DigitalSequence) – The HMM profiles to query. Pass a HMMFile instance to read from disk iteratively.

Returns

TopHits – the hits found in the profile database.

Raises

AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query or profile.

Caution

In the current version, this method is not optimized to use the pressed database, even if it exists. This will cause the MSV and SSV filters to be rebuilt at each iteration, which could be slow. Consider at least pre-fetching the HMM database if calling this method several times in a row.

New in version v0.4.0.

search_hmm(query, sequences)

Run the pipeline using a query HMM against a sequence database.

Parameters
  • query (HMM) – The HMM object to use to query the sequence database.

  • sequences (iterable of DigitalSequence) – The sequences to query with the HMM. For instance, pass a SequenceFile in digital mode to read from disk iteratively.

Returns

TopHits – the hits found in the sequence database.

Raises

AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given HMM.

New in version 0.2.0.

search_msa(query, sequences, builder=None)

Run the pipeline using a query alignment against a sequence database.

Parameters
  • query (DigitalMSA) – The multiple sequence alignment to use to query the sequence database.

  • sequences (iterable of DigitalSequence) – The sequences to query. Pass a SequencesFile instance in digital mode to read from disk iteratively.

  • builder (Builder, optional) – A HMM builder to use to convert the query to a HMM. If None is given, it will use a default one.

Returns

TopHits – the hits found in the sequence database.

Raises

AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query.

New in version 0.3.0.

search_seq(query, sequences, builder=None)

Run the pipeline using a query sequence against a sequence database.

Parameters
  • query (DigitalSequence) – The sequence object to use to query the sequence database.

  • sequences (iterable of DigitalSequence) – The sequences to query. Pass a SequenceFile instance in digital mode to read from disk iteratively.

  • builder (Builder, optional) – A HMM builder to use to convert the query to a HMM. If None is given, it will use a default one.

Returns

TopHits – the hits found in the sequence database.

Raises

AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query.

New in version 0.2.0.

F1

The MSV filter threshold.

New in version 0.4.1.

Type

float

F2

The Viterbi filter threshold.

New in version 0.4.1.

Type

float

F3

The uncorrected Forward filter threshold.

New in version 0.4.1.

Type

float

Z

The number of effective targets searched.

It is used to compute the independent e-value for each domain, and for an entire hit. If None, the parameter number will be set automatically after all the comparisons have been done. Otherwise, it can be set to an arbitrary number.

Type

float or None

bias_filter

Whether or not to enable the biased comp HMM filter.

New in version 0.4.1.

Type

bool

domZ

The number of significant targets.

It is used to compute the conditional e-value for each domain. If None, the parameter number will be set automatically after all the comparisons have been done, and all hits have been thresholded. Otherwise, it can be set to an arbitrary number.

Type

float or None

null2

Whether or not to enable the null2 score correction.

New in version 0.4.1.

Type

bool

seed

The seed given at pipeline initialization.

Setting this attribute to a different value will cause the random number generator to be reseeded immediately.

New in version 0.2.0.

Changed in version 0.4.2: Avoid shadowing initial null seeds given on pipeline initialization.

Type

int

Builder

class pyhmmer.plan7.Builder

A factory for constructing new HMMs from raw sequences.

alphabet

The alphabet the builder is configured to use to convert sequences to HMMs.

Type

Alphabet

randomness

The random number generator being used by the builder.

Type

Randomness

New in version 0.2.0.

Changed in version 0.4.2: Added the randomness attribute.

__init__(alphabet, *, architecture='fast', weighting='pb', effective_number='entropy', prior_scheme='alpha', symfrac=0.5, fragthresh=0.5, wid=0.62, esigma=45.0, eid=0.62, EmL=200, EmN=200, EvL=200, EvN=200, EfL=100, EfN=200, Eft=0.04, seed=42, ere=None, popen=None, pextend=None)

Create a new sequence builder with the given configuration.

Parameters

alphabet (Alphabet) – The alphabet the builder expects the sequences to be in.

Keyword Arguments
  • popen (float) – The gap open probability to use with the score system. Default depends on the alphabet: 0.02 for proteins, 0.03125 for nucleotides.

  • pextend (float) – The gap extend probability to use with the score system. Default depends on the alphabet: 0.4 for proteins, 0.75 for nucleotides.

  • seed (int) – The seed to use to initialize the internal random number generator. If 0 is given, an arbitrary seed will be chosen based on the current time.

build(sequence, background)

Build a new HMM from sequence using the builder configuration.

Parameters
  • sequence (DigitalSequence) – A single biological sequence in digital mode to build a HMM with.

  • background (pyhmmer.plan7.background) – The background model to use to create the HMM.

Raises

AlphabetMismatch – When either sequence or background have the wrong alphabet for this builder.

build_msa(msa, background)

Build a new HMM from msa using the builder configuration.

Parameters
  • msa (DigitalMSA) – A multiple sequence alignment in digital mode to build a HMM with.

  • background (pyhmmer.plan7.background) – The background model to use to create the HMM.

Raises

AlphabetMismatch – When either msa or background have the wrong alphabet for this builder.

New in version 0.3.0.

copy()

Create a duplicate Builder instance with the same arguments.

seed

The seed given at builder initialization.

Setting this attribute to a different value will cause the internal random number generator to be reseeded immediately.

Changed in version 0.4.2: Avoid shadowing initial null seeds given on builder initialization.

Type

int

Results

TopHits

class pyhmmer.plan7.TopHits

A ranked list of top-scoring hits.

TopHits are thresholded using the parameters from the pipeline, and are sorted by key when you obtain them from a Pipeline instance:

>>> abc = thioesterase.alphabet
>>> hits = Pipeline(abc).search_hmm(thioesterase, proteins)
>>> hits.is_sorted()
True

Use len to query the number of top hits, and the usual indexing notation to extract a particular Hit:

>>> len(hits)
1
>>> hits[0].name
b'938293.PRJEB85.HG003687_113'
__init__()

Create an empty TopHits instance.

is_sorted(by='key')

Check whether or not the hits are sorted with the given method.

See sort for a list of allowed values for the by argument.

sort(by='key')

Sort hits in the current instance using the given method.

Parameters

by (str) – The comparison method to use to compare hits. Allowed values are: key (the default) to sort by key, or seqidx to sort by sequence index and alignment position.

threshold(pipeline)

Apply score and e-value thresholds using pipeline parameters.

This function is automatically called in Pipeline.search_hmm or Pipeline.search_seq, so it should have limited usefulness at the Python level.

to_msa(alphabet, trim=False, digitize=False, all_consensus_cols=False)

Create multiple alignment of all included domains.

Parameters
  • alphabet (Alphabet) – The alphabet of the HMM this TopHits was obtained from. It is required to convert back hits to single sequences.

  • trim (bool) – Trim off any residues that get assigned to flanking \(N\) and \(C\) states (in profile traces) or \(I_0\) and \(I_m\) (in core traces).

  • digitize (bool) – If set to True, returns a DigitalMSA instead of a TextMSA.

  • all_consensus_cols (bool) – Force a column to be created for every consensus column in the model, even if it means having all gap character in a column.

Returns

MSA – A multiple sequence alignment containing the reported hits, either a TextMSA or a DigitalMSA depending on the value of the digitize argument.

New in version 0.3.0.

included

The number of hits that are above the inclusion threshold.

Type

int

reported

The number of hits that are above the reporting threshold.

Type

int

Hit

class pyhmmer.plan7.Hit

A high-scoring database hit found by the comparison pipeline.

accession

The accession of the database hit, if any.

Type

bytes or None

best_domain

The best scoring domain in this hit.

New in version 0.4.2.

Type

Domain

bias

The null2 contribution to the uncorrected score.

Type

float

description

The description of the database hit, if any.

Type

bytes or None

domains

The list of domains aligned to this hit.

Type

Domains

evalue

The e-value of the hit.

Type

float

name

The name of the database hit.

Type

bytes

pre_score

Bit score of the sequence before null2 correction.

Type

float

pvalue

The p-value of the bitscore.

New in version 0.4.2.

Type

float

score

Bit score of the sequence with all domains after correction.

Type

float

Domains

class pyhmmer.plan7.Domains

A read-only view over the domains of a single Hit.

hit

The target hit these domains belong hit.

Type

Hit

Domain

class pyhmmer.plan7.Domain

A single domain in a query Hit.

hit

The hit this domains is part of.

Type

Hit

alignment

The alignment of this domain to a target sequence.

Type

Alignment

bias

The null2 score contribution to the domain score.

Type

float

c_evalue

The conditional e-value for the domain.

Type

float

correction

The null2 score when calculating a per-domain score.

Type

float

envelope_score

The forward score in the envelope, without null2 correction.

Type

float

i_evalue

The independent e-value for the domain.

Type

float

pvalue

The p-value of the domain bitscore.

Type

float

score

The overall score in bits, null2-corrected.

Type

float

Alignment

class pyhmmer.plan7.Alignment

An alignment of a sequence to a profile.

domain

The domain this alignment corresponds to.

Type

Domain

hmm_accession

The accession of the query, or its name if it has none.

New in version 0.1.4.

Type

bytes

hmm_from

The start coordinate of the alignment in the query HMM.

Type

int

hmm_name

The name of the query HMM.

Type

bytes

hmm_sequence

The sequence of the query HMM in the alignment.

Type

str

hmm_to

The end coordinate of the alignment in the query HMM.

Type

int

identity_sequence

The identity sequence between the query and the target.

Type

str

target_from

The start coordinate of the alignment in the target sequence.

Type

int

target_name

The name of the target sequence.

Type

bytes

target_sequence

The sequence of the target sequence in the alignment.

Type

str

target_to

The end coordinate of the alignment in the target sequence.

Type

int