Plan7¶

High-level interface to the Plan7 data model.

Plan7 is the model architecture used by HMMER since HMMER2.

Hidden Markov Model¶

HMM¶

class pyhmmer.plan7.HMM¶

A data structure storing the Plan7 Hidden Markov Model.

alphabet¶

The alphabet of the model.

Type: Alphabet

Changed in version 0.4.6: Added the evalue_parameters and cutoffs attributes.

__init__(M, alphabet)¶

Create a new HMM from scratch.

Parameters

M (int) – The length of the model (i.e. the number of nodes).
alphabet (Alphabet) – The alphabet of the model.

copy()¶: Return a copy of the HMM with the exact same configuration.

New in version 0.3.0.

match_occupancy()¶

Calculate the match occupancy for each match state.

Returns: VectorF – A vector of size \(M+1\) containing the probability that each match state is used in a sample glocal path through the model.

New in version 0.4.10.

mean_match_entropy()¶

Compute the mean entropy per HMM match state.

The mean entropy per match state emission distribution is defined as:

\[- \frac{1}{M} \sum_{k=1}^{M} \sum_x p_k(x) \log_2 p_k(x)\]

where \(p_k(x)\) is the emission probability for symbol \(x\) from match state \(k\).

Returns: float – The mean entropy, in bits.

Example

>>> thioesterase.mean_match_entropy()
3.0425...

New in version 0.4.10.

mean_match_information(background)¶

Compute the mean information content of the HMM match states.

The mean information content per match state emission distribution is defined as:

\[\frac{1}{M} \sum_{k=1}^{M} \left[ \sum_x f(x) \log_2 f(x) - \sum_x p_k(x) \log_2 p_k(x) \right]\]

in bits, where \(p_k(x)\) is the emission probability for symbol \(x\) from match state \(k\), and \(f(x)\) is the background emission probability for \(x\) from the null model.

Parameters: background (Background) – The null model from which to get the background emission probabilities.
Returns: float – The mean information content, in bits.

Example

>>> background = plan7.Background(easel.Alphabet.amino())
>>> thioesterase.mean_match_information(background)
1.1330...

New in version 0.4.10.

mean_match_relative_entropy(background)¶

Compute the mean relative entropy per HMM match state.

The mean relative entropy per match state emission distribution is defined as:

\[\frac{1}{M} \sum_{k=1}^{M} \sum_x { p_k(x) \log_2 \frac{p_k(x)}{f(x)} }\]

in bits, where \(p_k(x)\) is the emission probability for symbol \(x\) from match state \(k\), and \(f(x)\) is the background emission probability for \(x\) from the null model.

Parameters: background (Background) – The null model from which to get the background emission probabilities.
Returns: float – The mean relative entropy, in bits.

Example

>>> background = plan7.Background(easel.Alphabet.amino())
>>> thioesterase.mean_match_relative_entropy(background)
1.1201...

New in version 0.4.10.

renormalize()¶: Renormalize all parameter vectors (emissions and transitions).

New in version 0.4.0.

scale(scale, exponential=False)¶

Rescale counts by a factor in a model containing counts.

This method only affects core probability model emissions and transitions.

Parameters

scale (float) – The scaling factor to use (\(1.0\) for no scaling). Often computed using the ratio of effective sequences (\(\frac{n_{eff}}{n_{seq}}\))
exponential (bool) – When set to True, use scale as an exponential factor (\(C_i = C_i^s\)) instead of a multiplicative factor (\(C_i = C_i \times s\)), resulting in a non-uniform scaling across columns. This can be useful when some heavily fragmented sequences are used to reconstruct a family MSA.

New in version 0.4.0.

set_composition()¶: Calculate and set the model composition.

New in version 0.4.0.

write(fh, binary=False)¶

Write the HMM to a file handle.

Parameters

fh (io.IOBase) – A Python file handle, opened in binary mode (this must be the case even with binary=False, since the C code will emit bytes in either case).
binary (bool) – Pass False to emit the file in ASCII mode using the latest supported HMMER format, or True to use the binary HMMER3 format.

zero()¶: Set all parameters to zero, including model composition.

M¶

The length of the model (i.e. the number of nodes).

Type: int

accession¶

The accession of the HMM, if any.

Type: bytes or None

checksum¶

The 32-bit checksum of the HMM, if any.

The checksum if calculated from the alignment the HMM was created from, and was introduced in more recent HMM formats. This means some HMM objects may have a non-None checksum.

New in version 0.2.1.

Changed in version 0.3.1: Returns None if the HMM flag for the checksum is not set.

Type: int or None

command_line¶

The command line that built the model.

For HMMs created with Builder, this defaults to sys.argv. It can however be set to any string, including multiline to show successive commands.

Example

>>> print(thioesterase.command_line)
hmmbuild Thioesterase.hmm Thioesterase.fa
hmmcalibrate Thioesterase.hmm

New in version 0.3.1.

Type: str or None

composition¶

The model composition.

May not be available for freshly-created HMMs. To get the mean residue composition emitted by the model, the set_composition method must be called to compute the composition from occupancy.

Note

Although the allocated buffer in the P7_HMM object is always the same dimension so that it can store the largest possible alphabet, we only expose the relevant residues. This means that the vector will be of size alphabet.K:

>>> dna = easel.Alphabet.dna()  # dna.K=4
>>> hmm = plan7.HMM(100, dna)
>>> hmm.set_composition()
>>> len(hmm.composition)
4

New in version 0.4.0.

Type: VectorF or None

consensus¶

The consensus residue line of the HMM, if set.

New in version 0.3.0.

Type: str or None

consensus_accessibility¶

The consensus accessibility of the HMM, if any.

New in version 0.3.1.

Type: str or None

consensus_structure¶

The consensus structure of the HMM, if any.

New in version 0.3.1.

Type: str or None

creation_time¶

The creation time of the HMM, if any.

Example

Get the creation time for any HMM:

>>> thioesterase.creation_time
datetime.datetime(2008, 11, 25, 17, 28, 32)

Set the creation time manually to a different date and time:

>>> ctime = datetime.datetime(2021, 8, 23, 23, 59, 19)
>>> thioesterase.creation_time = ctime
>>> thioesterase.creation_time
datetime.datetime(2021, 8, 23, 23, 59, 19)

Danger

Internally, libhmmer always uses asctime to generate a timestamp for the HMMs, so this property assumes that every creation time field can be parsed into a datetime.datetime object using the "%a %b %d %H:%M:%S %Y" format.

New in version 0.4.6.

Type: datetime.datetime or None

cutoffs¶

The bitscore cutoffs for this HMM.

Type: Cutoffs

description¶

The description of the HMM, if any.

Type: bytes or None

evalue_parameters¶

The e-value parameters for this HMM.

Type: EvalueParameters

insert_emissions¶

The insert emissions of the model.

The property exposes a matrix of shape \((M+1, K)\), with one row per node and one column per alphabet symbol.

Caution

If editing this matrix manually, note that rows must contain valid probabilities for the HMM to be valid: each row must contains values between 0 and 1, and sum to 1.

New in version 0.3.1.

Changed in version 0.4.0: This property is now a MatrixF, and stores row 0.

Type: memoryview of float

match_emissions¶

The match emissions of the model.

The property exposes a matrix of shape \((M+1, K)\), with one row per node and one column per alphabet symbol.

Note

Since the first row corresponds to the entry probabilities, the emissions are unused. By convention, it should still contain valid probabilities, so it will always be set as follow with 1 probability for the first symbol, and 0 for the rest:

>>> hmm = HMM(100, alphabet=easel.Alphabet.dna())
>>> hmm.match_emissions[0]
VectorF([1.0, 0.0, 0.0, 0.0])

Caution

If editing this matrix manually, note that rows must contain valid probabilities for the HMM to be valid: each row must contains values between 0 and 1, and sum to 1.

New in version 0.3.1.

Changed in version 0.4.0: This property is now a MatrixF, and stores row 0.

Type: MatrixF

model_mask¶

The model mask line from the alignment, if any.

New in version 0.3.1.

Type: str or None

name¶

The name of the HMM, if any.

Type: bytes or None

nseq¶

The number of training sequences used, if any.

If the HMM was created from a multiple sequence alignment, this corresponds to the number of sequences in the MSA.

Example

>>> thioesterase.nseq
278

New in version 0.3.1.

Type: int or None

nseq_effective¶

The number of effective sequences used, if any.

New in version 0.3.1.

Type: float or None

reference¶

The reference line from the alignment, if any.

This is relevant if the HMM was built from a multiple sequence alignment (e.g. by Builder.build_msa, or by an external hmmbuild pipeline run).

New in version 0.3.1.

Type: str or None

transition_probabilities¶

The transition probabilities of the model.

The property exposes a matrix of shape \((M+1, 7)\), with one row per node (plus one extra row for the entry probabilities), and one column per transition.

Columns indices correspond to the following transitions weights:

0 for \(M_n \to M_{n+1}\)
1 for \(M_n \to I_{n+1}\)
2 for \(M_n \to D_{n+1}\)
3 for \(I_n \to M_{n+1}\)
4 for \(I_n \to I_{n+1}\)
5 for \(D_n \to M_{n+1}\)
6 for \(D_n \to D_{n+1}\)

Example

>>> t = thioesterase.transition_probabilities
>>> t[0, 5]  # TDM, 1 by convention
1.0

Caution

If editing this matrix manually, note that some invariants need to hold for the HMM to be valid: \(I_n\), \(M_n\) and \(D_n\) transition probabilities should only contain probabilities between 0 and 1, and sum to 1:

>>> t = thioesterase.transition_probabilities
>>> t[50, 0] + t[50, 1] + t[50, 2]  # M_n probabilities
1.000...
>>> t[50, 3] + t[50, 4]  # I_n probabilities
1.000...
>>> t[50, 5] + t[50, 6]  # D_n probabilties
1.000...

New in version 0.3.1.

Changed in version 0.4.0: This property is now a MatrixF.

Type: MatrixF

HMM File¶

class pyhmmer.plan7.HMMFile¶

A wrapper around a file (or database), storing serialized HMMs.

__init__(file, db=True)¶

Create a new HMM reader from the given file.

Parameters

file (str or file-like object) – Either the path to a file containing the HMMs to read, or a file-like object opened in binary mode.
db (bool) – Set to False to force the parser to ignore the pressed HMM database if it finds one. Defaults to True.

close()¶

Close the HMM file and free resources.

This method has no effect if the file is already closed. It is called automatically if the HMMFile was used in a context:

>>> with HMMFile("tests/data/hmms/bin/PKSI-AT.h3m") as hmm_file:
...     hmm = hmm_file.read()

is_pressed()¶

Check whether the HMM file is a pressed HMM database.

A pressed database is an HMMER format to store optimized profiles in several files on the disk. It can be used to reduce the time needed to process sequences by cutting down the time needed to convert from an HMM to an OptimizedProfile.

Example

>>> HMMFile("tests/data/hmms/txt/PKSI-AT.hmm").is_pressed()
False
>>> HMMFile("tests/data/hmms/bin/PKSI-AT.h3m").is_pressed()
False
>>> HMMFile("tests/data/hmms/db/PKSI-AT.hmm").is_pressed()
True

New in version 0.4.11.

optimized_profiles()¶: Get an iterator over the `OptimizedProfile`s in the HMM database.

New in version 0.4.11.

read()¶

Read the next HMM from the file.

Returns

HMM – The next HMM in the file, or None if all HMMs were read from the file already.

Raises

ValueError – When attempting to read a HMM from a closed file, or when the file could not be parsed.
AllocationError – When memory for the HMM could not be allocated successfully.

New in version 0.4.11.

closed¶

Whether the HMMFile is closed or not.

Type: bool

Profile¶

class pyhmmer.plan7.Profile¶

A Plan7 search profile.

alphabet¶

The alphabet for which this profile was configured.

Type: Alphabet

Changed in version 0.4.6: Added the evalue_parameters and cutoffs attributes.

__init__(M, alphabet)¶

Create a new profile for the given alphabet.

Parameters

M (int) – The length of the profile, i.e. the number of nodes.
alphabet (Alphabet) – The alphabet to use with this profile.

clear()¶: Clear internal buffers to reuse the profile without reallocation.

configure(hmm, background, L, multihit=True, local=True)¶

Configure a search profile using the given models.

Parameters

hmm (pyhmmer.plan7.HMM) – The model HMM with core probabilities.
bg (pyhmmer.plan7.Background) – The null background model.
L (int) – The expected target sequence length.
multihit (bool) – Whether or not to use multihit modes.
local (bool) – Whether or not to use non-local modes.

copy()¶: Return a copy of the profile with the exact same configuration.

is_local()¶: Return whether or not the profile is in a local alignment mode.

is_multihit()¶: Returns whether or not the profile is in a multihit alignment mode.

optimized()¶

Convert the profile to a platform-specific optimized profile.

Returns: OptimizedProfile – The platform-specific optimized profile built using the configuration of this profile.

L¶

The current configured target sequence length.

Type: int

M¶

The length of the profile (i.e. the number of nodes).

New in version 0.3.0.

Type: int

accession¶

The accession of the profile, if any.

New in version 0.3.0.

Type: bytes or None

consensus¶

The consensus residue line of the profile, if set.

New in version 0.4.1.

Type: str or None

consensus_structure¶

The consensus structure of the profile, if any.

New in version 0.4.1.

Type: str or None

cutoffs¶

The bitscore cutoffs for this profile, if any.

Type: Cutoffs

description¶

The description of the profile, if any.

New in version 0.3.0.

Type: bytes or None

evalue_parameters¶

The e-value parameters for this profile.

Type: EvalueParameters

name¶

The name of the profile, if any.

New in version 0.3.0.

Type: bytes or None

offsets¶

The disk offsets for this profile.

Type: Offsets

OptimizedProfile¶

class pyhmmer.plan7.OptimizedProfile¶

An optimized profile that uses platform-specific instructions.

alphabet¶

The alphabet for which this optimized profile is configured.

Type: Alphabet

__init__(M, alphabet)¶

Create a new optimized profile from scratch.

Optimized profiles use platform-specific code to accelerate the various algorithms. Although you can allocate an optimized profile yourself, the only way to obtain a fully configured profile is to create it with the Profile.optimized method, after having configured the profile for a given HMM with Profile.configure.

Parameters

M (int) – The length of the model (i.e. the number of nodes).
alphabet (Alphabet) – The alphabet of the model.

convert(profile)¶

Store the given profile into self as a platform-specific profile.

Use this method to obtained an optimized profile from a Profile while recycling the internal vector buffers.

Background¶

class pyhmmer.plan7.Background¶

The null background model of HMMER.

alphabet¶

The alphabet of the backgound model.

Type: Alphabet

uniform¶

Whether or not the null model has been created with uniform frequencies.

Type: bool

residue_frequencies¶

The null1 background residue frequencies.

Type: VectorF

Changed in version 0.4.0: Added the residue_frequencies attribute.

__init__(alphabet, uniform=False)¶

Create a new background model for the given alphabet.

Parameters

alphabet (pyhmmer.easel.Alphabet) – The alphabet to create the background model with.
uniform (bool) – Whether or not to create the null model with uniform frequencies. Defaults to False.

copy()¶: Create a copy of the null model with the same parameters.

L¶

The mean of the null model length distribution, in residues.

Type: int

omega¶

The prior on null2/null3.

New in version 0.4.0.

Type: float

transition_probability¶

The null1 transition probability (\(\frac{L}{L+1}\)).

New in version 0.4.0.

Type: float

Pipelines¶

Pipeline¶

class pyhmmer.plan7.Pipeline¶

An HMMER3 accelerated sequence/profile comparison pipeline.

alphabet¶

The alphabet for which the pipeline is configured.

Type: Alphabet

background¶

The null background model to use to compute scores.

Type: Background

randomness¶

The random number generator being used by the pipeline.

Type: Randomness

Changed in version 0.4.2: Added the randomness attribute.

__init__(alphabet, background=None, *, bias_filter=True, null2=True, seed=42, Z=None, domZ=None, F1=0.02, F2=0.001, F3=1e-05, E=10.0, T=None, domE=10.0, domT=None, incE=0.01, incT=None, incdomE=0.01, incdomT=None, bit_cutoffs=None)¶

Instantiate and configure a new accelerated comparison pipeline.

Parameters

alphabet (Alphabet) – The biological alphabet the of the HMMs and sequences that are going to be compared. Used to build the background model.
background (Background, optional) – The background model to use with the pipeline, or None to create and use a default one. The pipeline needs ownership of the background model, so any background model passed there will be copied.

Keyword Arguments

bias_filter (bool) – Whether or not to enable composition bias filter. Defaults to True.
null2 (bool) – Whether or not to compute biased composition score corrections. Defaults to True.
seed (int, optional) – The seed to use with the random number generator. Pass 0 to use a one-time arbitrary seed, or None to keep the default seed from HMMER.
Z (int, optional) – The effective number of comparisons done, for E-value calculation. Leave as None to auto-detect by counting the number of sequences queried.
domZ (int, optional) – The number of significant sequences found, for domain E-value calculation. Leave as None to auto-detect by counting the number of sequences reported.
F1 (float) – The MSV filter threshold.
F2 (float) – The Viterbi filter threshold.
F3 (float) – The uncorrected Forward filter threshold.
E (float) – The per-target E-value threshold for reporting a hit.
T (float, optional) – The per-target bit score threshold for reporting a hit. If given, takes precedence over E.
domE (float) – The per-domain E-value threshold for reporting a domain hit.
domT (float, optional) – The per-domain bit score threshold for reporint a domain hit. If given, takes precedence over domE.
incE (float) – The per-target E-value threshold for including a hit in the resulting TopHits.
incT (float, optional) – The per-target bit score threshold for including a hit in the resulting TopHits. If given, takes precedence over incE.
incdomE (float) – The per-domain E-value threshold for including a domain in the resulting TopHits.
incdomT (float, optional) – The per-domain bit score thresholds for including a domain in the resulting TopHits. If given, takes precedence over incdomE.
bit_cutoffs (str, optional) – The model-specific thresholding option to use for reporting hits. With None (the default), use global pipeline options; otherwise pass one of "noise", "gathering" or "trusted" to use the appropriate cutoffs.

Hint

In order to run the pipeline in slow/max mode, disable the bias filter, and set the three filtering parameters to \(1.0\):

>>> dna = easel.Alphabet.dna()
>>> max_pli = Pipeline(dna, bias_filter=False, F1=1.0, F2=1.0, F3=1.0)

Changed in version 0.4.6: Added keywords arguments to set the E-value thresholds.

clear()¶: Reset the pipeline configuration to its default state.

scan_seq(query, hmms)¶

Run the pipeline using a query sequence against a profile database.

Parameters

query (DigitalSequence) – The sequence object to use to query the profile database.
hmms (iterable of DigitalSequence) – The HMM profiles to query. Pass a HMMFile instance to read from disk iteratively.

Returns

TopHits – the hits found in the profile database.

Raises

AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query or profile.

Caution

In the current version, this method is not optimized to use the pressed database, even if it exists. This will cause the MSV and SSV filters to be rebuilt at each iteration, which could be slow. Consider at least pre-fetching the HMM database if calling this method several times in a row.

New in version v0.4.0.

search_hmm(query, sequences)¶

Run the pipeline using a query HMM against a sequence database.

Parameters

query (HMM, Profile or OptimizedProfile) – The object to use to query the sequence database.
sequences (collection of DigitalSequence) – The sequences to query with the HMM.

Returns

TopHits – the hits found in the sequence database.

Raises

ValueError – When the pipeline is configured to use model-specific reporting thresholds but the HMM query doesn’t have the right cutoffs available.
AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given HMM.

New in version 0.2.0.

Changed in version 0.4.9: Query can now be a Profile or an OptimizedProfile.

search_msa(query, sequences, builder=None)¶

Run the pipeline using a query alignment against a sequence database.

Parameters

query (DigitalMSA) – The multiple sequence alignment to use to query the sequence database.
sequences (collection of DigitalSequence) – The sequences to query.
builder (Builder, optional) – A HMM builder to use to convert the query to a HMM. If None is given, it will use a default one.

Returns

TopHits – the hits found in the sequence database.

Raises

AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query.

New in version 0.3.0.

search_seq(query, sequences, builder=None)¶

Run the pipeline using a query sequence against a sequence database.

Parameters

query (DigitalSequence) – The sequence object to use to query the sequence database.
sequences (collection of DigitalSequence) – The sequences to query.
builder (Builder, optional) – A HMM builder to use to convert the query to a HMM. If None is given, it will use a default one.

Returns

TopHits – the hits found in the sequence database.

Raises

AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query.

New in version 0.2.0.

E¶

The per-target E-value threshold for reporting a hit.

New in version 0.4.6.

Type: float

F1¶

The MSV filter threshold.

New in version 0.4.1.

Type: float

F2¶

The Viterbi filter threshold.

New in version 0.4.1.

Type: float

F3¶

The uncorrected Forward filter threshold.

New in version 0.4.1.

Type: float

T¶

The per-target score threshold for reporting a hit.

If set to a non-None value, this threshold takes precedence over the per-target E-value threshold (Pipeline.E).

New in version 0.4.8.

Type: float or None

Z¶

The number of effective targets searched.

It is used to compute the independent e-value for each domain, and for an entire hit. If None, the parameter number will be set automatically after all the comparisons have been done. Otherwise, it can be set to an arbitrary number.

Type: float or None

bias_filter¶

Whether or not to enable the biased comp HMM filter.

New in version 0.4.1.

Type: bool

bit_cutoffs¶

The model-specific thresholding option, if any.

New in version 0.4.6.

Type: str or None

domE¶

The per-domain E-value threshold for reporting a hit.

New in version 0.4.6.

Type: float

domT¶

The per-domain score threshold for reporting a hit.

If set to a non-None value, this threshold takes precedence over the per-domain E-value threshold (Pipeline.domE).

New in version 0.4.8.

Type: float or None

domZ¶

The number of significant targets.

It is used to compute the conditional e-value for each domain. If None, the parameter number will be set automatically after all the comparisons have been done, and all hits have been thresholded. Otherwise, it can be set to an arbitrary number.

Type: float or None

incE¶

The per-target E-value threshold for including a hit.

New in version 0.4.6.

Type: float

incT¶

The per-target score threshold for including a hit.

If set to a non-None value, this threshold takes precedence over the per-target E-value inclusion threshold (Pipeline.incE).

New in version 0.4.8.

Type: float or None

incdomE¶

The per-domain E-value threshold for including a hit.

New in version 0.4.6.

Type: float

incdomT¶

The per-domain score threshold for including a hit.

If set to a non-None value, this threshold takes precedence over the per-domain E-value inclusion threshold (Pipeline.incdomE).

New in version 0.4.8.

Type: float or None

null2¶

Whether or not to enable the null2 score correction.

New in version 0.4.1.

Type: bool

seed¶

The seed given at pipeline initialization.

Setting this attribute to a different value will cause the random number generator to be reseeded immediately.

New in version 0.2.0.

Changed in version 0.4.2: Avoid shadowing initial null seeds given on pipeline initialization.

Type: int

LongTargetsPipeline¶

class pyhmmer.plan7.LongTargetsPipeline(Pipeline)¶

An HMMER3 pipeline tuned for long targets.

New in version 0.4.9.

__init__(alphabet, background=None, *, F1=0.02, F2=0.003, F3=3e-05, strand=None, B1=100, B2=240, B3=1000, block_length=262144, **kwargs)¶

Instantiate and configure a new long targets pipeline.

Parameters

alphabet (Alphabet) – The biological alphabet the of the HMMs and sequences that are going to be compared. Used to build the background model. A nucleotide alphabet is expected.
background (Background, optional) – The background model to use with the pipeline, or None to create and use a default one. The pipeline needs ownership of the background model, so any background model passed there will be copied.

Keyword Arguments

strand (str, optional) – The strand to use when processing nucleotide sequences. Use "watson" to use only the coding strand, "crick" to use only the reverse strand, or leave as None to process both strands.
B1 (int) – The window length to use for the biased-composition modifier of the MSV filter.
B2 (int) – The window length to use for the biased-composition modifier of the Viterbi filter.
B3 (int) – The window length to use for the biased-composition modifier of the Forward filter.
block_length (int) – The number of residues to use as the window size when reading blocks from target sequences.
**kwargs – Any additional parameter will be passed to the Pipeline constructor.

scan_seq()¶

Run the pipeline using a query sequence against a profile database.

Parameters

query (DigitalSequence) – The sequence object to use to query the profile database.
hmms (iterable of DigitalSequence) – The HMM profiles to query. Pass a HMMFile instance to read from disk iteratively.

Returns

TopHits – the hits found in the profile database.

Raises

AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query or profile.

Caution

New in version v0.4.0.

search_hmm()¶

Run the pipeline using a query HMM against a sequence database.

Parameters

query (HMM, Profile or OptimizedProfile) – The object to use to query the sequence database.
sequences (collection of DigitalSequence) – The sequences to query with the HMM.

Returns

TopHits – the hits found in the sequence database.

Raises

ValueError – When the pipeline is configured to use model-specific reporting thresholds but the HMM query doesn’t have the right cutoffs available.
AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given HMM.

New in version 0.2.0.

Changed in version 0.4.9: Query can now be a Profile or an OptimizedProfile.

search_msa()¶

Run the pipeline using a query alignment against a sequence database.

Parameters

query (DigitalMSA) – The multiple sequence alignment to use to query the sequence database.
sequences (collection of DigitalSequence) – The sequences to query.
builder (Builder, optional) – A HMM builder to use to convert the query to a HMM. If None is given, it will use a default one.

Returns

TopHits – the hits found in the sequence database.

Raises

AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query.

New in version 0.3.0.

search_seq()¶

Run the pipeline using a query sequence against a sequence database.

Parameters

query (DigitalSequence) – The sequence object to use to query the sequence database.
sequences (collection of DigitalSequence) – The sequences to query.
builder (Builder, optional) – A HMM builder to use to convert the query to a HMM. If None is given, it will use a default one.

Returns

TopHits – the hits found in the sequence database.

Raises

AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query.

New in version 0.2.0.

B1¶

The window length for biased-composition modifier of the MSV.

Type: int

B2¶

The window length for biased-composition modifier of the Viterbi.

Type: int

B3¶

The window length for biased-composition modifier of the Forward.

Type: int

strand¶

The strand to process, or None for both.

Type: str or None

Builder¶

class pyhmmer.plan7.Builder¶

A factory for constructing new HMMs from raw sequences.

alphabet¶

The alphabet the builder is configured to use to convert sequences to HMMs.

Type: Alphabet

randomness¶

The random number generator being used by the builder.

Type: Randomness

score_matrix¶

The name of the substitution matrix used to build HMMs for single sequences.

Type: str

popen¶

The gap open probability to use when building HMMs from single sequences.

Type: float

pextend¶

The gap extend probability to use when building HMMs from single sequences.

Type: float

New in version 0.2.0.

Changed in version 0.4.2: Added the randomness attribute.

__init__(alphabet, *, architecture='fast', weighting='pb', effective_number='entropy', prior_scheme='alpha', symfrac=0.5, fragthresh=0.5, wid=0.62, esigma=45.0, eid=0.62, EmL=200, EmN=200, EvL=200, EvN=200, EfL=100, EfN=200, Eft=0.04, seed=42, ere=None, popen=None, pextend=None, window_length=None, window_beta=None)¶

Create a new sequence builder with the given configuration.

Parameters

alphabet (Alphabet) – The alphabet the builder expects the sequences to be in.

Keyword Arguments

architecture (str) – The algorithm to use to determine the model architecture, either "fast" (the default), or "hand".
weighting (str) – The algorithm to use for relative sequence weighting, either "pb" (the default), "gsc", "blosum", "none", or "given".
effective_number (str, int, or float) – The algorithm to use to determine the effective sequence number, either "entropy" (the default), "exp", "clust", "none". A number can also be given directly to set the effective sequence number manually.
prior_scheme (str) – The choice of mixture Dirichlet prior when parameterizing from counts, either "laplace" or "alphabet" (the default).
symfrac (float) – The residue occurrence threshold for fast architecture determination.
fragthresh (float) – A threshold such that a sequence is called a fragment when \(L \le fragthresh imes alen\).
wid (double) – The percent identity threshold for BLOSUM relative weighting.
esigma (float) – The minimum total relative entropy parameter for effective number entropy weights.
eid (float) – The percent identity threshold for effective number clustering.
EmL (int) – The length of sequences generated for MSV fitting.
EmN (int) – The number of sequences generated for MSV fitting.
EvL (int) – The lenght of sequences generated for Viterbi fitting.
EvN (int) – The number of sequences generated for Viterbi fitting.
EfL (int) – The lenght of sequences generated for Forward fitting.
EfN (int) – The number of sequences generated for Forward fitting.
Eft (float) – The tail mass used for Forward fitting.
seed (int) – The seed to use to initialize the internal random number generator. If 0 is given, an arbitrary seed will be chosen based on the current time.
ere (double, optional) – The relative entropy target for effective number weighting, or None.
popen (float, optional) – The gap open probability to use when building HMMs from single sequences. The default value depends on the alphabet: 0.02 for proteins, 0.03125 for nucleotides.
pextend (float, optional) – The gap extend probability to use when building HMMs from single sequences. Default depends on the alphabet: 0.4 for proteins, 0.75 for nucleotides.
score_matrix (str, optional) – The name of the score matrix to use when building HMMs from single sequences. The only allowed value for nucleotide alphabets is DNA1. For proteins, PAM30, PAM70, PAM120, PAM240, BLOSUM45, BLOSUM50, BLOSUM62 (the default), BLOSUM80 or BLOSUM90 can be used.
window_length (float, optional) – The window length for nucleotide sequences, essentially the max expected hit length. If given, takes precedence over window_beta.
window_beta (float, optional) – The tail mass at which window length is determined for nucleotide sequences.

build(sequence, background)¶

Build a new HMM from sequence using the builder configuration.

Parameters

sequence (DigitalSequence) – A single biological sequence in digital mode to build a HMM with.
background (pyhmmer.plan7.background) – The background model to use to create the HMM.

Returns

(HMM, Profile, OptimizedProfile) – A tuple containing the new HMM as well as profiles to be used directly in a Pipeline.

Raises

AlphabetMismatch – When either sequence or background have the wrong alphabet for this builder.

Hint

The score matrix and the gap probabilities used here can be set when initializing the builder, or changed by setting a new value to the right attribute:

>>> alphabet = easel.Alphabet.amino()
>>> background = plan7.Background(alphabet)
>>> builder = plan7.Builder(alphabet, popen=0.05)
>>> builder.score_matrix = "BLOSUM90"
>>> hmm, _, _ = builder.build(proteins[0], background)

Changed in version 0.4.6: Sets the HMM.creation_time attribute with the current time.

Changed in version 0.4.10: The score system is now cached between calls to Builder.build.

build_msa(msa, background)¶

Build a new HMM from msa using the builder configuration.

Parameters

msa (DigitalMSA) – A multiple sequence alignment in digital mode to build a HMM with.
background (pyhmmer.plan7.background) – The background model to use to create the HMM.

Returns

(HMM, Profile, OptimizedProfile) – A tuple containing the new HMM as well as profiles to be used directly in a Pipeline.

Raises

AlphabetMismatch – When either msa or background have the wrong alphabet for this builder.

New in version 0.3.0.

Changed in version 0.4.6: Sets the HMM.creation_time attribute with the current time.

copy()¶: Create a duplicate Builder instance with the same arguments.

seed¶

The seed given at builder initialization.

Setting this attribute to a different value will cause the internal random number generator to be reseeded immediately.

Changed in version 0.4.2: Avoid shadowing initial null seeds given on builder initialization.

Type: int

window_beta¶

The tail mass at which window length is determined.

Type: float

window_length¶

The window length for nucleotide sequences.

Type: int or None

Results¶

TopHits¶

class pyhmmer.plan7.TopHits¶

A ranked list of top-scoring hits.

TopHits are thresholded using the parameters from the pipeline, and are sorted by key when you obtain them from a Pipeline instance:

>>> abc = thioesterase.alphabet
>>> hits = Pipeline(abc).search_hmm(thioesterase, proteins)
>>> hits.is_sorted()
True

Use len to query the number of top hits, and the usual indexing notation to extract a particular Hit:

>>> len(hits)
1
>>> hits[0].name
b'938293.PRJEB85.HG003687_113'

__init__()¶: Create an empty TopHits instance.

is_sorted(by='key')¶

Check whether or not the hits are sorted with the given method.

See sort for a list of allowed values for the by argument.

sort(by='key')¶

Sort hits in the current instance using the given method.

Parameters: by (str) – The comparison method to use to compare hits. Allowed values are: key (the default) to sort by key, or seqidx to sort by sequence index and alignment position.

threshold(pipeline)¶

Apply score and e-value thresholds using pipeline parameters.

This function is automatically called in Pipeline.search_hmm or Pipeline.search_seq, so it should have limited usefulness at the Python level.

to_msa(alphabet, trim=False, digitize=False, all_consensus_cols=False)¶

Create multiple alignment of all included domains.

Parameters

alphabet (Alphabet) – The alphabet of the HMM this TopHits was obtained from. It is required to convert back hits to single sequences.
trim (bool) – Trim off any residues that get assigned to flanking \(N\) and \(C\) states (in profile traces) or \(I_0\) and \(I_m\) (in core traces).
digitize (bool) – If set to True, returns a DigitalMSA instead of a TextMSA.
all_consensus_cols (bool) – Force a column to be created for every consensus column in the model, even if it means having all gap character in a column.

Returns

MSA – A multiple sequence alignment containing the reported hits, either a TextMSA or a DigitalMSA depending on the value of the digitize argument.

New in version 0.3.0.

included¶

The number of hits that are above the inclusion threshold.

Type: int

reported¶

The number of hits that are above the reporting threshold.

Type: int

Hit¶

class pyhmmer.plan7.Hit¶

A high-scoring database hit found by the comparison pipeline.

is_dropped()¶: Check if the hit was dropped.

is_duplicate()¶: Check if the hit is a duplicate.

is_included()¶: Check if the hit should be included with respect to the thresholds.

is_new()¶: Check if the hit is a new hit.

is_reported()¶: Check if the hit should be reported with respect to the thresholds.

accession¶

The accession of the database hit, if any.

Type: bytes or None

best_domain¶

The best scoring domain in this hit.

New in version 0.4.2.

Type: Domain

bias¶

The null2 contribution to the uncorrected score.

Type: float

description¶

The description of the database hit, if any.

Type: bytes or None

domains¶

The list of domains aligned to this hit.

Type: Domains

evalue¶

The e-value of the hit.

Type: float

name¶

The name of the database hit.

Type: bytes

pre_score¶

Bit score of the sequence before null2 correction.

Type: float

pvalue¶

The p-value of the bitscore.

New in version 0.4.2.

Type: float

score¶

Bit score of the sequence with all domains after correction.

Type: float

sum_score¶

Bit score reconstructed from the sum of domain envelopes.

New in version 0.4.6.

Type: float

Domains¶

class pyhmmer.plan7.Domains¶

A read-only view over the domains of a single Hit.

hit¶

The target hit these domains belong hit.

Type: Hit

Domain¶

class pyhmmer.plan7.Domain¶

A single domain in a query Hit.

hit¶

The hit this domains is part of.

Type: Hit

alignment¶

The alignment of this domain to a target sequence.

Type: Alignment

bias¶

The null2 score contribution to the domain score.

Type: float

c_evalue¶

The conditional e-value for the domain.

Type: float

correction¶

The null2 score when calculating a per-domain score.

Type: float

env_from¶

The start coordinate of the domain envelope.

Type: int

env_to¶

The end coordinate of the domain envelope.

Type: int

envelope_score¶

The forward score in the envelope, without null2 correction.

Type: float

i_evalue¶

The independent e-value for the domain.

Type: float

pvalue¶

The p-value of the domain bitscore.

Type: float

score¶

The overall score in bits, null2-corrected.

Type: float

Alignment¶

class pyhmmer.plan7.Alignment¶

An alignment of a sequence to a profile.

domain¶

The domain this alignment corresponds to.

Type: Domain

hmm_accession¶

The accession of the query, or its name if it has none.

New in version 0.1.4.

Type: bytes

hmm_from¶

The start coordinate of the alignment in the query HMM.

Type: int

hmm_name¶

The name of the query HMM.

Type: bytes

hmm_sequence¶

The sequence of the query HMM in the alignment.

Type: str

hmm_to¶

The end coordinate of the alignment in the query HMM.

Type: int

identity_sequence¶

The identity sequence between the query and the target.

Type: str

target_from¶

The start coordinate of the alignment in the target sequence.

Type: int

target_name¶

The name of the target sequence.

Type: bytes

target_sequence¶

The sequence of the target sequence in the alignment.

Type: str

target_to¶

The end coordinate of the alignment in the target sequence.

Type: int

Traces¶

TraceAligner¶

class pyhmmer.plan7.TraceAligner¶

A factory for aligning several sequences to a reference model.

Example

>>> aligner = TraceAligner()
>>> traces = aligner.compute_traces(thioesterase, proteins[:100])
>>> msa = aligner.align_traces(thioesterase, proteins[:100], traces)

New in version 0.4.7.

align_traces(hmm, sequences, traces, trim=False, digitize=False, all_consensus_cols=False)¶

Compute traces for a collection of sequences relative to an HMM.

Parameters

hmm (HMM) – The reference HMM to use for the alignment.
sequences (collection of DigitalSequence) – The sequences to align to the HMM.
traces (Traces) – The traces corresponding to the alignment of sequences to hmm, obtained by a previous call to compute_traces.
trim (bool) – Trim off any residues that get assigned to flanking \(N\) and \(C\) states (in profile traces) or \(I_0\) and \(I_m\) (in core traces).
digitize (bool) – If set to True, returns a DigitalMSA instead of a TextMSA.
all_consensus_cols (bool) – Force a column to be created for every consensus column in the model, even if it means having all gap character in a column. Note that this is enabled by default for hmmalign since HMMER v3.1, but disabled here.

Returns

MSA – A multiple sequence alignment containing the aligned sequences, either a TextMSA or a DigitalMSA depending on the value of the digitize argument.

Raises

AlphabetMismatch – when the alphabet of any of the sequences does not correspond to the HMM alphabet.

compute_traces(hmm, sequences)¶

Compute traces for a collection of sequences relative to an HMM.

Parameters

hmm (HMM) – The reference HMM to use for the alignment.
sequences (collection of DigitalSequence) – The sequences to align to the HMM.

Returns

Traces – The traces for each sequence.

Raises

AlphabetMismatch – when the alphabet of any of the sequences does not correspond to the HMM alphabet.

Traces¶

class pyhmmer.plan7.Traces¶: A list of tracebacks obtained by aligning several sequences to a model.

New in version 0.4.7.

Trace¶

class pyhmmer.plan7.Trace¶

A traceback for the alignment of a model to a sequence.

New in version 0.4.7.

expected_accuracy()¶: Returns the sum of the posterior residue decoding probabilities.

L¶

The sequence length.

Type: int

M¶

The model length.

Type: int

posterior_probabilities¶

The posterior probability of each residue.

Type: VectorF

Miscellaneous¶

Cutoffs¶

class pyhmmer.plan7.Cutoffs¶

A mutable view over the score cutoffs of a HMM or a Profile.

New in version 0.4.6.

as_vector()¶: Return a view over the score cutoffs as a VectorF.

gathering_available()¶: Check whether the gathering thresholds are available.

noise_available()¶: Check whether the noise cutoffs are available.

trusted_available()¶: Check whether the trusted cutoffs are available.

gathering¶

The gathering thresholds, if any.

Example

This property can be used to set the gathering cutoffs by passing it an iterable of two float:

>>> thioesterase.cutoffs.gathering = (180.0, 120.0)
>>> thioesterase.cutoffs.gathering_available()
True
>>> thioesterase.cutoffs.gathering
(180.0, 120.0)

Set the attribute to None or delete it with del to clear the gathering thresholds:

>>> thioesterase.cutoffs.gathering = None
>>> thioesterase.cutoffs.gathering_available()
False

New in version 0.4.8.

Type: tuple of float, or None

gathering1¶

The first gathering threshold, if any.

Type: float or None

gathering2¶

The second gathering threshold, if any.

Type: float or None

noise¶

The noise cutoffs, if available.

New in version 0.4.8.

Type: tuple of float, or None

noise1¶

The first noise cutoff, if any.

Type: float or None

noise2¶

The second noise cutoff, if any.

Type: float or None

trusted¶

The trusted cutoffs, if available.

New in version 0.4.8.

Type: tuple of float, or None

trusted1¶

The first trusted score cutoff, if any.

Type: float or None

trusted2¶

The second trusted score cutoff, if any.

Type: float or None

EvalueParameters¶

class pyhmmer.plan7.EvalueParameters¶

A mutable view over the e-value parameters of a HMM or a Profile.

The e-value for each filter is estimated based off a maximum likelihood distribution fitted for each profile HMM, either a Gumbel distribution for the MSV and Viterbi filters, or an exponential distribution for the Forward filter.

New in version 0.4.6.