Plan7¶
High-level interface to the Plan7 data model.
Plan7 is the model architecture used by HMMER since HMMER2.
See also
Details about the Plan 7 architecture in the HMMER documentation.
Profile HMMs¶
HMM¶
- class pyhmmer.plan7.HMM¶
A data structure storing the Plan7 Hidden Markov Model.
Changed in version 0.4.6: Added the
evalue_parameters
andcutoffs
attributes.- __init__(alphabet, M, name)¶
Create a new HMM from scratch.
- copy()¶
Return a copy of the HMM with the exact same configuration.
New in version 0.3.0.
- match_occupancy()¶
Calculate the match occupancy for each match state.
- Returns:
VectorF
– A vector of size \(M+1\) containing the probability that each match state is used in a sample glocal path through the model.
New in version 0.4.10.
- mean_match_entropy()¶
Compute the mean entropy per HMM match state.
The mean entropy per match state emission distribution is defined as:
\[- \frac{1}{M} \sum_{k=1}^{M} \sum_x p_k(x) \log_2 p_k(x)\]where \(p_k(x)\) is the emission probability for symbol \(x\) from match state \(k\).
- Returns:
float
– The mean entropy, in bits.
Example
>>> thioesterase.mean_match_entropy() 3.0425...
New in version 0.4.10.
- mean_match_information(background)¶
Compute the mean information content of the HMM match states.
The mean information content per match state emission distribution is defined as:
\[\frac{1}{M} \sum_{k=1}^{M} \left[ \sum_x f(x) \log_2 f(x) - \sum_x p_k(x) \log_2 p_k(x) \right]\]in bits, where \(p_k(x)\) is the emission probability for symbol \(x\) from match state \(k\), and \(f(x)\) is the background emission probability for \(x\) from the null model.
- Parameters:
background (
Background
) – The null model from which to get the background emission probabilities.- Returns:
float
– The mean information content, in bits.
Example
>>> background = plan7.Background(easel.Alphabet.amino()) >>> thioesterase.mean_match_information(background) 1.1330...
New in version 0.4.10.
- mean_match_relative_entropy(background)¶
Compute the mean relative entropy per HMM match state.
The mean relative entropy per match state emission distribution is defined as:
\[\frac{1}{M} \sum_{k=1}^{M} \sum_x { p_k(x) \log_2 \frac{p_k(x)}{f(x)} }\]in bits, where \(p_k(x)\) is the emission probability for symbol \(x\) from match state \(k\), and \(f(x)\) is the background emission probability for \(x\) from the null model.
- Parameters:
background (
Background
) – The null model from which to get the background emission probabilities.- Returns:
float
– The mean relative entropy, in bits.
Example
>>> background = plan7.Background(easel.Alphabet.amino()) >>> thioesterase.mean_match_relative_entropy(background) 1.1201...
New in version 0.4.10.
- renormalize()¶
Renormalize all parameter vectors (emissions and transitions).
New in version 0.4.0.
- sample(M, alphabet, randomness)¶
Sample an HMM of length
M
at random.- Parameters:
alphabet (
Alphabet
) – The alphabet of the model.M (
int
) – The length of the model to generate (i.e. the number of nodes).randomness (
Randomness
) – The random number generator to use for sampling.ungapped (
bool
) – Set toTrue
to build an ungapped HMM, i.e. an HMM where the \(M_n o M_{n+1}\) are all one and the remaining transitions are zero. Ignored whenenumerable
isTrue
.enumerable (
bool
) – Set toTrue
to build a random HMM with no nonzero insertion transitions.
- Returns:
HMM
– A new HMM generated at random.
New in version 0.7.0.
- scale(scale, exponential=False)¶
Rescale counts by a factor in a model containing counts.
This method only affects core probability model emissions and transitions.
- Parameters:
scale (
float
) – The scaling factor to use (\(1.0\) for no scaling). Often computed using the ratio of effective sequences (\(\frac{n_{eff}}{n_{seq}}\))exponential (
bool
) – When set toTrue
, usescale
as an exponential factor (\(C_i = C_i^s\)) instead of a multiplicative factor (\(C_i = C_i \times s\)), resulting in a non-uniform scaling across columns. This can be useful when some heavily fragmented sequences are used to reconstruct a family MSA.
New in version 0.4.0.
- set_composition()¶
Calculate and set the model composition.
New in version 0.4.0.
- to_profile(background=None, L=400, multihit=True, local=True)¶
Create a new profile configured for this HMM.
This method is a shortcut for creating a new
Profile
and callingconfigure
for a given HMM. Prefer manually callingconfigure
to recycle the profile buffer when running inside a loop.- Parameters:
background (
Background
, optional) – The null background model. InNone
given, create a default one for the HMM alphabet.L (
int
) – The expected target sequence length.multihit (
bool
) – Whether or not to use multihit mode.local (
bool
) – Whether to use local or global mode.
New in version 0.8.0.
- validate(tolerance=0.0001)¶
Validate the HMM agains the Plan7 structural constraints.
HMMs created with PyHMMER are always valid, whether they come from a
Builder
or from theHMM
constructor. However, it is possible to manually edit the emission scores and transition probabilities, and the structural constrains may not hold. Call this method to make sure the HMM is valid after you are done editing it.- Parameters:
tolerance (
float
) – The tolerance with which to check that the transition probabilities sum to 1.- Raises:
ValueError – When the HMM fails validation.
New in version 0.8.1.
- write(fh, binary=False)¶
Write the HMM to a file handle.
- Parameters:
- zero()¶
Set all parameters to zero, including model composition.
- checksum¶
The 32-bit checksum of the HMM, if any.
The checksum if calculated from the alignment the HMM was created from, and was introduced in more recent HMM formats. This means some
HMM
objects may have a non-None
checksum.New in version 0.2.1.
Changed in version 0.3.1: Returns
None
if the HMM flag for the checksum is not set.
- command_line¶
The command line that built the model.
For HMMs created with
Builder
, this defaults tosys.argv
. It can however be set to any string, including multiline to show successive commands.Example
>>> print(thioesterase.command_line) hmmbuild Thioesterase.hmm Thioesterase.fa hmmcalibrate Thioesterase.hmm
New in version 0.3.1.
- composition¶
The model composition.
May not be available for freshly-created HMMs. To get the mean residue composition emitted by the model, the
set_composition
method must be called to compute the composition from occupancy.Note
Although the allocated buffer in the
P7_HMM
object is always the same dimension so that it can store the largest possible alphabet, we only expose the relevant residues. This means that the vector will be of sizealphabet.K
:>>> dna = easel.Alphabet.dna() >>> dna.K 4 >>> hmm = plan7.HMM(dna, 100, b"test") >>> hmm.set_composition() >>> len(hmm.composition) 4
New in version 0.4.0.
- consensus_accessibility¶
The consensus accessibility of the HMM, if any.
New in version 0.3.1.
- consensus_structure¶
The consensus structure of the HMM, if any.
New in version 0.3.1.
- creation_time¶
The creation time of the HMM, if any.
Example
Get the creation time for any HMM:
>>> thioesterase.creation_time datetime.datetime(2008, 11, 25, 17, 28, 32)
Set the creation time manually to a different date and time:
>>> ctime = datetime.datetime(2021, 8, 23, 23, 59, 19) >>> thioesterase.creation_time = ctime >>> thioesterase.creation_time datetime.datetime(2021, 8, 23, 23, 59, 19)
Danger
Internally,
libhmmer
always usesasctime
to generate a timestamp for the HMMs, so this property assumes that every creation time field can be parsed into adatetime.datetime
object using the"%a %b %d %H:%M:%S %Y"
format.New in version 0.4.6.
- Type:
- evalue_parameters¶
The e-value parameters for this HMM.
- Type:
- insert_emissions¶
The insert emissions of the model.
The property exposes a matrix of shape \((M+1, K)\), with one row per node and one column per alphabet symbol.
Caution
If editing this matrix manually, note that rows must contain valid probabilities for the HMM to be valid: each row must contains values between 0 and 1, and sum to 1. Consider calling
HMM.validate
after manual edition.New in version 0.3.1.
Changed in version 0.4.0: This property is now a
MatrixF
, and stores row 0.- Type:
- match_emissions¶
The match emissions of the model.
The property exposes a matrix of shape \((M+1, K)\), with one row per node and one column per alphabet symbol.
Note
Since the first row corresponds to the entry probabilities, the emissions are unused. By convention, it should still contain valid probabilities, so it will always be set as follow with 1 probability for the first symbol, and 0 for the rest:
>>> hmm = HMM(easel.Alphabet.dna(), 100, b"test") >>> hmm.match_emissions[0] pyhmmer.easel.VectorF([1.0, 0.0, 0.0, 0.0])
Caution
If editing this matrix manually, note that rows must contain valid probabilities for the HMM to be valid: each row must contains values between 0 and 1, and sum to 1. Consider calling
HMM.validate
after manual edition.New in version 0.3.1.
Changed in version 0.4.0: This property is now a
MatrixF
, and stores row 0.- Type:
- nseq¶
The number of training sequences used, if any.
If the HMM was created from a multiple sequence alignment, this corresponds to the number of sequences in the MSA.
Example
>>> thioesterase.nseq 278
New in version 0.3.1.
- nseq_effective¶
The number of effective sequences used, if any.
New in version 0.3.1.
- reference¶
The reference line from the alignment, if any.
This is relevant if the HMM was built from a multiple sequence alignment (e.g. by
Builder.build_msa
, or by an externalhmmbuild
pipeline run).New in version 0.3.1.
- transition_probabilities¶
The transition probabilities of the model.
The property exposes a matrix of shape \((M+1, 7)\), with one row per node (plus one extra row for the entry probabilities), and one column per transition.
Columns correspond to the following transitions, in order: \(M_n \to M_{n+1}\), \(M_n \to I_n\), \(M_n \to D_{n+1}\), \(I_n \to I_n\), \(I_n \to M_{n+1}\), \(D_n \to D_{n+1}\), \(D_n \to M_{n+1}\). Use the
Transitions
enum instead of hardcoded indices to make your core more legible.Example
>>> t = thioesterase.transition_probabilities >>> t[1, Transitions.MM] 0.999... >>> t[0, Transitions.DM] # 1 by convention for the first node 1.0 >>> t[-1, Transitions.MD] # 0 by convention for the last node 0.0
Caution
If editing this matrix manually, note that some invariants need to hold for the HMM to be valid: \(I_n\), \(M_n\) and \(D_n\) transition probabilities should only contain probabilities between 0 and 1, and sum to 1:
>>> t = thioesterase.transition_probabilities >>> t[50, Transitions.MM] + t[50, Transitions.MI] + t[50, Transitions.MD] 1.000... >>> t[50, Transitions.IM] + t[50, Transitions.II] 1.000... >>> t[50, Transitions.DM] + t[50, Transitions.DD] 1.000...
Consider calling
HMM.validate
after manual edition.New in version 0.3.1.
Changed in version 0.4.0: This property is now a
MatrixF
.- Type:
Profile¶
- class pyhmmer.plan7.Profile¶
A Plan7 search profile.
Changed in version 0.4.6: Added the
evalue_parameters
andcutoffs
attributes.- __init__(M, alphabet)¶
Create a new profile for the given
alphabet
.
- clear()¶
Clear internal buffers to reuse the profile without reallocation.
- configure(hmm, background, L=400, multihit=True, local=True)¶
Configure a search profile using the given models.
- Parameters:
hmm (
HMM
) – The model HMM with core probabilities.background (
Background
) – The null background model.L (
int
) – The expected target sequence length.multihit (
bool
) – Whether or not to use multihit modes.local (
bool
) – Whether or not to use non-local modes.
- copy()¶
Return a copy of the profile with the exact same configuration.
- to_optimized()¶
Convert the profile to a platform-specific optimized profile.
- Returns:
OptimizedProfile
– The platform-specific optimized profile built using the configuration of this profile.
- consensus¶
The consensus residue line of the profile, if set.
New in version 0.4.1.
- consensus_structure¶
The consensus structure of the profile, if any.
New in version 0.4.1.
- cutoffs¶
The bitscore cutoffs for this profile, if any.
- Type:
Cutoffs
- evalue_parameters¶
The e-value parameters for this profile.
- Type:
EvalueParameters
OptimizedProfile¶
- class pyhmmer.plan7.OptimizedProfile¶
An optimized profile that uses platform-specific instructions.
Optimized profiles store the match emissions and transition probabilities for a profile HMM so that they can be loaded in the SIMD code. Typically, matrices use aligned storage so that they can loaded efficiently, and are striped à la Farrar to compute pairwise scores for each sequence residue and profile node.
References
Farrar, Michael. Striped Smith–Waterman Speeds Database Searches Six Times over Other SIMD Implementations. Bioinformatics 23, no. 2 (15 January 2007): 156–61. doi:10.1093/bioinformatics/btl582.
- __init__(M, alphabet)¶
Create a new optimized profile from scratch.
Once allocated, you must call the
convert
method with aProfile
object. It’s actually easier to useProfile.to_optimized
method to obtained a configuredOptimizedProfile
directly, unless you’re explicitly trying to recycle memory.
- convert(profile)¶
Store the given profile into
self
as a platform-specific profile.Use this method to configure an optimized profile from a
Profile
while recycling the internal vector buffers.- Raises:
ValueError – When the optimized profile is too small to hold the profile, or when the standard and the optimized profiles are not compatible.
See also
The
Profile.to_optimized
method, which allows getting anOptimizedProfile
directly from a profile without having to allocate first.
- copy()¶
Create an exact copy of the optimized profile.
Note
The
Alphabet
referenced to by this object is not copied, usecopy.deepcopy
if this is the intended behaviour.
- ssv_filter(seq)¶
Compute the SSV filter score for the given sequence.
- Parameters:
seq (
DigitalSequence
) – The sequence in digital format for which to compute the SSV filter score.- Returns:
Note
- Raises:
AlphabetMismatch – When the alphabet of the sequence does not correspond to the profile alphabet.
Caution
This method is not available on the PowerPC platform (calling it will raise a
NotImplementedError
).New in version 0.4.0.
- write(fh_filter, fh_profile)¶
Write an optimized profile to two separate files.
HMMER implements an acceleration pipeline using several scoring algorithms. Parameters for MSV (the Multi ungapped Segment Viterbi) are saved independently to the
fh_filter
handle, while the rest of the profile is saved tofh_profile
.
- consensus¶
The consensus residue line of the profile, if any.
New in version 0.4.11.
- consensus_structure¶
The consensus structure of the profile, if any.
New in version 0.4.11.
- cutoffs¶
The bitscore cutoffs for this profile, if any.
- Type:
Cutoffs
- evalue_parameters¶
The e-value parameters for this profile.
- Type:
EvalueParameters
- offsets¶
The disk offsets for this optimized profile.
- Type:
Offsets
OptimizedProfileBlock¶
- class pyhmmer.plan7.OptimizedProfileBlock¶
A container for storing
OptimizedProfile
objects.This collections allows the scan loop of
Pipeline
objects to scan several target profiles without having to acquire the GIL for each new model. It does so by synchronizing a Pythonlist
storingOptimizedProfile
objects with a C-contiguous array of pointers to the underlying struct of each object.New in version 0.7.0.
- __init__(*args, **kwargs)¶
- append(optimized_profile)¶
Append an optimized profile at the end of the block.
- clear()¶
Remove all optimized profiles from the block.
- copy()¶
Return a copy of the optimized profile block.
Note
The optimized profiles internally refered to by this collection are not copied. Use
copy.deepcopy
if you also want to duplicate the internal storage of each optimized profile.
- extend(iterable, /)¶
Extend block by appending optimized profiles from the iterable.
- index(optimized_profile, start=0, stop=9223372036854775807)¶
Return the index of the first occurence of
optimized_profile
.- Raises:
ValueError – When the block does not contain
optimized_profile
.
- insert(index, optimized_profile)¶
Insert a new optimized profile in the block before
index
.
- pop(index=-1)¶
Remove and return an optimized profile from the block.
- remove(sequence)¶
Remove the first occurence of the given optimized profile.
HMM Files¶
HMMFile¶
- class pyhmmer.plan7.HMMFile¶
A wrapper around a file (or database), storing serialized HMMs.
You can use this class to iterate on the HMMs inside a file, loading them into
HMM
objects.Example
Load the first HMM from an HMM file located on the local filesystem:
>>> with HMMFile("tests/data/hmms/txt/PF02826.hmm") as hmm_file: ... hmm = hmm_file.read() >>> hmm.name b'2-Hacid_dh_C' >>> hmm.accession b'PF02826.20'
Load all the HMMs from an HMM file into a
list
:>>> with HMMFile("tests/data/hmms/txt/t2pks.hmm") as hmm_file: ... hmms = list(hmm_file) >>> len(hmms) 40 >>> hmms[0].name b'CLF'
- __init__(file, db=True)¶
Create a new HMM reader from the given file.
- close()¶
Close the HMM file and free resources.
This method has no effect if the file is already closed. It is called automatically if the
HMMFile
was used in a context:>>> with HMMFile("tests/data/hmms/bin/PKSI-AT.h3m") as hmm_file: ... hmm = hmm_file.read()
- is_pressed()¶
Check whether the HMM file is a pressed HMM database.
A pressed database is an HMMER format to store optimized profiles in several files on the disk. It can be used to reduce the time needed to process sequences by cutting down the time needed to convert from an
HMM
to anOptimizedProfile
.Example
>>> HMMFile("tests/data/hmms/txt/PKSI-AT.hmm").is_pressed() False >>> HMMFile("tests/data/hmms/bin/PKSI-AT.h3m").is_pressed() False >>> HMMFile("tests/data/hmms/db/PKSI-AT.hmm").is_pressed() True
New in version 0.4.11.
- optimized_profiles()¶
Get an iterator over the
OptimizedProfile
in the HMM database.- Returns:
HMMPressedFile
– An iterator over the optimized profiles in a pressed HMM database.
New in version 0.4.11.
- read()¶
Read the next HMM from the file.
- Returns:
HMM
– The next HMM in the file, orNone
if all HMMs were read from the file already.- Raises:
ValueError – When attempting to read a HMM from a closed file, or when the file could not be parsed.
AllocationError – When memory for the HMM could not be allocated successfully.
New in version 0.4.11.
- rewind()¶
Rewind the file back to the beginning.
HMMPressedFile¶
- class pyhmmer.plan7.HMMPressedFile¶
An iterator over each
OptimizedProfile
in a pressed HMM database.This class cannot be instantiated: use the
HMMFile.optimized_profiles
to obtain an instance from anHMMFile
that wraps a pressed HMM database.Example
Use a pressed HMM database to run a search pipeline using optimized profiles directly, instead of converting them from the text HMMs:
>>> with HMMFile("tests/data/hmms/db/Thioesterase.hmm") as hfile: ... models = hfile.optimized_profiles() ... hits = next(pyhmmer.hmmsearch(models, proteins))
In this example,
hmmsearch
will receive an iterator ofOptimizedProfile
instead of an iterator ofHMM
ishfile
was passed directly; this letshmmsearch
skip the conversion step before running the search pipeline.New in version 0.4.11.
Changed in version 0.7.0: Allow instantiating an
HMMPressedFile
from a filename.- __init__(file)¶
Create a new pressed file from the given filename.
- Parameters:
file (
str
,bytes
oros.PathLike
) – The path to the pressed HMM file containing the optimized profiles to read.
- close()¶
Close the pressed file and free resources.
This method has no effect if the file is already closed. It is called automatically if the
HMMFile
was used in a context:>>> with HMMPressedFile("tests/data/hmms/db/PKSI-AT.hmm") as hmm_db: ... optimized_profile = hmm_db.read()
- read()¶
Read the next optimized profile from the file.
- Returns:
OptimizedProfile
– The next optimized profile in the file, orNone
if all profiles were read from the file already.- Raises:
ValueError – When attempting to read an optimized profile from a closed file, or when the file could not be parsed.
AllocationError – When memory for the
OptimizedProfile
could not be allocated successfully.
New in version 0.4.11.
- rewind()¶
Rewind the file back to the beginning.
- closed¶
Whether the
HMMPressedFile
is closed or not.New in version 0.7.0.
- Type:
- name¶
The path to the HMM file.
Note
Unlike
HMMFile.name
this attribute can never beNone
, since aHMMPressedFile
object can only be created from a filename, and not from a file-like object.New in version 0.7.0.
- Type:
Pipelines¶
Pipeline¶
- class pyhmmer.plan7.Pipeline¶
An HMMER3 accelerated sequence/profile comparison pipeline.
The Plan7 pipeline handles the platform-accelerated comparison of sequences to profile HMMs. It performs either a search (comparing a single query profile to a target sequence database) or a scan (comparing a single query sequence to a target profile database). The two methods are yielding equivalent results: if you have a collection of \(M\) sequences and \(N\) HMMs to compare, doing a search or a scan should give the same raw scores. The E-values will however be different if
Z
anddomZ
where not set manually: \(Z\) will be set to \(M\) for a search, and to \(N\) for a scan.The main reason for which you should choose search or scan is the relative size of the sequences and HMMs databases. In the original HMMER3 code, the memory was managed in a way that you never had to load the entirety of the target sequences in memory. In PyHMMER, the methods accept both reading the target database from a file, or loading it entirely into memory. A scan is always slower than a search because of the overhead introduced when reconfiguring a profile for a new sequence.
- background¶
The null background model to use to compute scores.
- Type:
- randomness¶
The random number generator being used by the pipeline.
- Type:
Changed in version 0.4.2: Added the
randomness
attribute.- __init__(alphabet, background=None, *, bias_filter=True, null2=True, seed=42, Z=None, domZ=None, F1=0.02, F2=0.001, F3=1e-05, E=10.0, T=None, domE=10.0, domT=None, incE=0.01, incT=None, incdomE=0.01, incdomT=None, bit_cutoffs=None)¶
Instantiate and configure a new accelerated comparison pipeline.
- Parameters:
alphabet (
Alphabet
) – The biological alphabet the of the HMMs and sequences that are going to be compared. Used to build the background model.background (
Background
, optional) – The background model to use with the pipeline, orNone
to create and use a default one. The pipeline needs ownership of the background model, so any background model passed there will be copied.
- Keyword Arguments:
bias_filter (
bool
) – Whether or not to enable composition bias filter. Defaults toTrue
.null2 (
bool
) – Whether or not to compute biased composition score corrections. Defaults toTrue
.seed (
int
, optional) – The seed to use with the random number generator. Pass 0 to use a one-time arbitrary seed, orNone
to keep the default seed from HMMER.Z (
int
, optional) – The effective number of comparisons done, for E-value calculation. Leave asNone
to auto-detect by counting the number of sequences queried.domZ (
int
, optional) – The number of significant sequences found, for domain E-value calculation. Leave asNone
to auto-detect by counting the number of sequences reported.F1 (
float
) – The MSV filter threshold.F2 (
float
) – The Viterbi filter threshold.F3 (
float
) – The uncorrected Forward filter threshold.E (
float
) – The per-target E-value threshold for reporting a hit.T (
float
, optional) – The per-target bit score threshold for reporting a hit. If given, takes precedence overE
.domE (
float
) – The per-domain E-value threshold for reporting a domain hit.domT (
float
, optional) – The per-domain bit score threshold for reporint a domain hit. If given, takes precedence overdomE
.incE (
float
) – The per-target E-value threshold for including a hit in the resultingTopHits
.incT (
float
, optional) – The per-target bit score threshold for including a hit in the resultingTopHits
. If given, takes precedence overincE
.incdomE (
float
) – The per-domain E-value threshold for including a domain in the resultingTopHits
.incdomT (
float
, optional) – The per-domain bit score thresholds for including a domain in the resultingTopHits
. If given, takes precedence overincdomE
.bit_cutoffs (
str
, optional) – The model-specific thresholding option to use for reporting hits. WithNone
(the default), use global pipeline options; otherwise pass one of"noise"
,"gathering"
or"trusted"
to use the appropriate cutoffs.
Hint
In order to run the pipeline in slow/max mode, disable the bias filter, and set the three filtering parameters to \(1.0\):
>>> dna = easel.Alphabet.dna() >>> max_pli = Pipeline(dna, bias_filter=False, F1=1.0, F2=1.0, F3=1.0)
Changed in version 0.4.6: Added keywords arguments to set the E-value thresholds.
- arguments()¶
Generate an
argv
-like list with pipeline options as CLI flags.Example
>>> alphabet = easel.Alphabet.amino() >>> plan7.Pipeline(alphabet).arguments() [] >>> plan7.Pipeline(alphabet, F1=0.01).arguments() ['--F1', '0.01']
New in version 0.6.0.
- clear()¶
Reset the pipeline configuration to its default state.
- iterate_hmm(query, sequences, builder=None, select_hits=None)¶
Run the pipeline to find homologous sequences to a query HMM.
- Parameters:
query (
HMM
) – The sequence object to use to query the sequence database.sequences (
DigitalSequenceBlock
) – The target sequences to query with the HMM.builder (
Builder
, optional) – A HMM builder to use to convert the query and subsequent alignments to aHMM
. IfNone
is given, this method will create one with the default parameters.select_hits (callable, optional) – A function or callable object for manually selecting hits during each iteration. It should take a single
TopHits
argument and change the inclusion of every individual hits by setting theincluded
anddropped
flags of eachHit
manually.
- Returns:
IterativeSearch
– An iterator object yielding the hits, sequence alignment, and HMM for each iteration.- Raises:
AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query or database sequences.
Hint
This method corresponds to running
jackhmmer
with thequery
sequence against thesequences
database.Caution
Default values used for
jackhmmer
do not correspond to the default parameters used for creating a pipeline in the other cases. To have truly identical results to thejackhmmer
results in default mode, create thePipeline
object withincE=0.001
andincdomE=0.001
.See also
The
iterate_seq
, which does the same operation with a query sequence instead of a query HMM, and contains more details and examples.Changed in version 0.7.0: Targets must now be inside a
DigitalSequenceBlock
.
- iterate_seq(query, sequences, builder=None, select_hits=None)¶
Run the pipeline to find homologous sequences to a query sequence.
This method implements an iterative search over a database to find all sequences homologous to a query sequence. It is very sensitive to the pipeline inclusion thresholds (
incE
andincdomE
).Since this method returns an iterator, the local results of each iteration will be available for inspection before starting the next one. The
select_hits
callback in particular can be used for manually including/excluding hits in each iteration, which is not supported in the originaljackhmmer
, but available on the HMMER web client.- Parameters:
query (
DigitalSequence
) – The sequence object to use to query the sequence database.sequences (
DigitalSequenceBlock
) – The target sequences to query with the query sequence.builder (
Builder
, optional) – A HMM builder to use to convert the query and subsequent alignments to aHMM
. IfNone
is given, this method will create one with the default parameters.select_hits (callable, optional) – A function or callable object for manually selecting hits during each iteration. It should take a single
TopHits
argument and change the inclusion of individual hits with theHit.include
andHit.drop
methods.
- Returns:
IterativeSearch
– An iterator object yielding the hits, sequence alignment, and HMM for each iteration.- Raises:
AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query or database sequences.
Hint
This method corresponds to running
jackhmmer
with thequery
sequence against thesequences
database.Caution
Default values used for
jackhmmer
do not correspond to the default parameters used for creating a pipeline in the other cases. To have truly identical results to thejackhmmer
results in default mode, create thePipeline
object withincE=0.001
andincdomE=0.001
.Example
Starting from a pipeline and a query sequence, let’s first obtain the iterator over the successive results:
>>> abc = easel.Alphabet.amino() >>> pli = plan7.Pipeline(abc, incE=1e-3, incdomE=1e-3) >>> iterator = pli.iterate_seq(reductase, proteins)
Once this is ready, we can keep iterating until we converge:
>>> converged = False >>> while not converged: ... _, hits, _, converged, _ = next(iterator) ... print(f"Hits: {len(hits)} Converged: {converged}") Hits: 1 Converged: False Hits: 2 Converged: False Hits: 2 Converged: True
To prevent diverging searches from running infinitely, you could wrap the search in a
for
loop instead, using a number of maximum iterations as the upper boundary:>>> iterator = pli.iterate_seq(reductase, proteins) >>> max_iterations = 10 >>> for n in range(max_iterations): ... iteration = next(iterator) ... if iteration.converged: ... break
New in version 0.6.0.
Changed in version 0.7.0: Targets must now be inside a
DigitalSequenceBlock
.
- search_msa(query, sequences, builder=None)¶
Run the pipeline using a query alignment against a sequence database.
- Parameters:
query (
DigitalMSA
) – The multiple sequence alignment to use to query the sequence database.sequences (
DigitalSequenceBlock
orSequenceFile
) – The target sequences to query with the alignment, either pre-loaded in memory inside apyhmmer.easel.DigitalSequenceBlock
, or to be read iteratively from aSequenceFile
opened in digital mode.builder (
Builder
, optional) – A HMM builder to use to convert the query to aHMM
. IfNone
is given, it will use a default one.
- Returns:
TopHits
– the hits found in the sequence database.- Raises:
AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query.
ValueError – When the
Builder
fails to create an HMM from the givenDigitalMSA
query.
Hint
This method corresponds to running
phmmer
with thequery
multiple sequence alignment against thesequences
database.Caution
Internally, this method will create a new HMM from the query MSA using the
Builder.build_msa
method. HMMER requires that every HMM has a name, so theBuilder
will attempt to use the name of the query MSA to name the HMM. Passing an MSA without a name will result in an error.New in version 0.3.0.
Changed in version 0.7.0: Targets can be inside a
DigitalSequenceBlock
or aSequenceFile
.
- search_seq(query, sequences, builder=None)¶
Run the pipeline using a query sequence against a sequence database.
- Parameters:
query (
DigitalSequence
) – The sequence object to use to query the sequence database.sequences (
DigitalSequenceBlock
orSequenceFile
) – The target sequences to query with the query sequence, either pre-loaded in memory inside apyhmmer.easel.DigitalSequenceBlock
, or to be read iteratively from aSequenceFile
opened in digital mode.builder (
Builder
, optional) – A HMM builder to use to convert the query to aHMM
. IfNone
is given, it will use a default one.
- Returns:
TopHits
– the hits found in the sequence database.- Raises:
AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query.
ValueError – When the
Builder
fails to create an HMM from the givenDigitalSequence
query.
Hint
This method corresponds to running
phmmer
with thequery
sequence against thesequences
database.New in version 0.2.0.
Changed in version 0.7.0: Targets can be inside a
DigitalSequenceBlock
or aSequenceFile
.
- T¶
The per-target score threshold for reporting a hit.
If set to a non-
None
value, this threshold takes precedence over the per-target E-value threshold (Pipeline.E
).New in version 0.4.8.
- Z¶
The number of effective targets searched.
It is used to compute the independent e-value for each domain, and for an entire hit. If
None
, the parameter number will be set automatically after all the comparisons have been done. Otherwise, it can be set to an arbitrary number.
- bit_cutoffs¶
The model-specific thresholding option, if any.
New in version 0.4.6.
- domT¶
The per-domain score threshold for reporting a hit.
If set to a non-
None
value, this threshold takes precedence over the per-domain E-value threshold (Pipeline.domE
).New in version 0.4.8.
- domZ¶
The number of significant targets.
It is used to compute the conditional e-value for each domain. If
None
, the parameter number will be set automatically after all the comparisons have been done, and all hits have been thresholded. Otherwise, it can be set to an arbitrary number.
- incT¶
The per-target score threshold for including a hit.
If set to a non-
None
value, this threshold takes precedence over the per-target E-value inclusion threshold (Pipeline.incE
).New in version 0.4.8.
- incdomT¶
The per-domain score threshold for including a hit.
If set to a non-
None
value, this threshold takes precedence over the per-domain E-value inclusion threshold (Pipeline.incdomE
).New in version 0.4.8.
- scan_seq¶
–
Run the pipeline using a query sequence against a profile database.
- Parameters:
query (
DigitalSequence
) – The sequence object to use to query the profile database.targets (
OptimizedProfileBlock
orHMMPressedFile
) – The optimized profiles to query, either pre-loaded in memory as anOptimizedProfileBlock
, or to be read iteratively from a pressed HMM file.
- Returns:
TopHits
– the hits found in the profile database.- Raises:
AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query or profiles.
Caution
In the current version, this method is not optimized to use the pressed database, even if it exists. This will cause the MSV and SSV filters to be rebuilt at each iteration, which could be slow. Consider at least pre-fetching the HMM database if calling this method several times in a row.
Hint
This method corresponds to running
hmmscan
with thequery
sequence against thehmms
database.New in version 0.4.0.
Changed in version 0.7.0: Require optimized profiles to be inside an
OptimizedProfileBlock
.
- search_hmm¶
–
Run the pipeline using a query HMM against a sequence database.
- Parameters:
query (
HMM
,Profile
orOptimizedProfile
) – The object to use to query the sequence database.sequences (
DigitalSequenceBlock
orSequenceFile
) – The target sequences to query with the HMM, either pre-loaded in memory inside aDigitalSequenceBlock
, or to be read iteratively from aSequenceFile
opened in digital mode.
- Returns:
TopHits
– the hits found in the sequence database.- Raises:
ValueError – When the pipeline is configured to use model-specific reporting thresholds but the
HMM
query doesn’t have the right cutoffs available.AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given HMM.
Hint
This method corresponds to running
hmmsearch
with thequery
HMM against thesequences
database.New in version 0.2.0.
Changed in version 0.4.9: Query can now be a
Profile
or anOptimizedProfile
.Changed in version 0.7.0: Targets can be inside a
DigitalSequenceBlock
or aSequenceFile
.
LongTargetsPipeline¶
- class pyhmmer.plan7.LongTargetsPipeline(Pipeline)¶
An HMMER3 pipeline tuned for long targets.
The default HMMER3 pipeline is configured not to accept target sequences longer than 100,000 residues. Although there is no strong limitation for this threshold, comparing a sequence of \(L\) residues to a profile with \(M\) nodes requires the allocation of a \(L imes M\) dynamic programming matrix.
For sequences too long, it’s actually more efficient memory-wise to use a sliding window to match the profile to the sequence. The usual comparison pipeline is then used to perform the comparison on each window, and results are merged once the entire sequence is done being processed. The context size \(C\) is large enough to accommodate for the entire profile, so that there is no risk of missing a hit in the overlaps between windows. The window size \(W\) can be changed with the
block_length
argument when instantiating a newLongTargetsPipeline
object.New in version 0.4.9.
- __init__(alphabet, background=None, *, F1=0.02, F2=0.003, F3=3e-05, strand=None, B1=100, B2=240, B3=1000, block_length=262144, **kwargs)¶
Instantiate and configure a new long targets pipeline.
- Parameters:
alphabet (
Alphabet
) – The biological alphabet the of the HMMs and sequences that are going to be compared. Used to build the background model. A nucleotide alphabet is expected.background (
Background
, optional) – The background model to use with the pipeline, orNone
to create and use a default one. The pipeline needs ownership of the background model, so any background model passed there will be copied.
- Keyword Arguments:
strand (
str
, optional) – The strand to use when processing nucleotide sequences. Use"watson"
to use only the coding strand,"crick"
to use only the reverse strand, or leave asNone
to process both strands.B1 (
int
) – The window length to use for the biased-composition modifier of the MSV filter.B2 (
int
) – The window length to use for the biased-composition modifier of the Viterbi filter.B3 (
int
) – The window length to use for the biased-composition modifier of the Forward filter.block_length (
int
) – The number of residues to use as the window size \(W\) when reading blocks from the long target sequences.**kwargs – Any additional parameter will be passed to the
Pipeline
constructor.
- arguments()¶
Generate an
argv
-like list with pipeline options as CLI flags.Example
>>> alphabet = easel.Alphabet.dna() >>> plan7.LongTargetsPipeline(alphabet).arguments() [] >>> plan7.LongTargetsPipeline(alphabet, B1=200).arguments() ['--B1', '200']
New in version 0.6.0.
- search_msa(query, sequences, builder=None)¶
Run the pipeline using a query alignment against a sequence database.
- Parameters:
query (
DigitalMSA
) – The multiple sequence alignment to use to query the sequence database.sequences (
DigitalSequenceBlock
) – The target sequences to query with the query alignment.builder (
Builder
, optional) – A HMM builder to use to convert the query to aHMM
. IfNone
is given, it will use a default one.
- Returns:
TopHits
– the hits found in the sequence database.- Raises:
AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query.
Hint
This method corresponds to running
nhmmer
with thequery
multiple sequence alignment against thesequences
database.Changed in version 0.7.0: Targets must now be inside a
DigitalSequenceBlock
.
- search_seq(query, sequences, builder=None)¶
Run the pipeline using a query sequence against a sequence database.
- Parameters:
query (
DigitalSequence
) – The sequence object to use to query the sequence database.sequences (
DigitalSequenceBlock
) – The target sequences to query with the query sequence.builder (
Builder
, optional) – A HMM builder to use to convert the query to aHMM
. IfNone
is given, it will use a default one.
- Returns:
TopHits
– the hits found in the sequence database.- Raises:
AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given query.
Hint
This method corresponds to running
nhmmer
with thequery
sequence against thesequences
database.Changed in version 0.7.0: Targets must now be inside a
DigitalSequenceBlock
.
- scan_seq¶
–
Run the pipeline using a query sequence against a profile database.
This is currently unsupported for
LongTargetsPipeline
.
- search_hmm¶
–
Run the pipeline using a query HMM against a sequence database.
- Parameters:
query (
HMM
,Profile
orOptimizedProfile
) – The object to use to query the sequence database.sequences (
DigitalSequenceBlock
) – The target sequences to query with the HMM.
- Returns:
TopHits
– the hits found in the sequence database.- Raises:
ValueError – When the pipeline is configured to use model-specific reporting thresholds but the
HMM
query doesn’t have the right cutoffs available.AlphabetMismatch – When the alphabet of the current pipeline does not match the alphabet of the given HMM.
Hint
This method corresponds to running
nhmmer
with thequery
HMM against thesequences
database.Changed in version 0.7.0: Targets must now be inside a
DigitalSequenceBlock
.
Builder¶
- class pyhmmer.plan7.Builder¶
A factory for constructing new HMMs from raw sequences.
- alphabet¶
The alphabet the builder is configured to use to convert sequences to HMMs.
- Type:
- randomness¶
The random number generator being used by the builder.
- Type:
- score_matrix¶
The name of the substitution matrix used to build HMMs for single sequences.
- Type:
New in version 0.2.0.
Changed in version 0.4.2: Added the
randomness
attribute.- __init__(alphabet, *, architecture='fast', weighting='pb', effective_number='entropy', prior_scheme='alpha', symfrac=0.5, fragthresh=0.5, wid=0.62, esigma=45.0, eid=0.62, EmL=200, EmN=200, EvL=200, EvN=200, EfL=100, EfN=200, Eft=0.04, seed=42, ere=None, popen=None, pextend=None, window_length=None, window_beta=None)¶
Create a new sequence builder with the given configuration.
- Parameters:
alphabet (
Alphabet
) – The alphabet the builder expects the sequences to be in.- Keyword Arguments:
architecture (
str
) – The algorithm to use to determine the model architecture, either"fast"
(the default), or"hand"
.weighting (
str
) – The algorithm to use for relative sequence weighting, either"pb"
(the default),"gsc"
,"blosum"
,"none"
, or"given"
.effective_number (
str
,int
, orfloat
) – The algorithm to use to determine the effective sequence number, either"entropy"
(the default),"exp"
,"clust"
,"none"
. A number can also be given directly to set the effective sequence number manually.prior_scheme (
str
) – The choice of mixture Dirichlet prior when parameterizing from counts, either"laplace"
or"alphabet"
(the default).symfrac (
float
) – The residue occurrence threshold for fast architecture determination.fragthresh (
float
) – A threshold such that a sequence is called a fragment when \(L \le fragthresh imes alen\).wid (
double
) – The percent identity threshold for BLOSUM relative weighting.esigma (
float
) – The minimum total relative entropy parameter for effective number entropy weights.eid (
float
) – The percent identity threshold for effective number clustering.EmL (
int
) – The length of sequences generated for MSV fitting.EmN (
int
) – The number of sequences generated for MSV fitting.EvL (
int
) – The lenght of sequences generated for Viterbi fitting.EvN (
int
) – The number of sequences generated for Viterbi fitting.EfL (
int
) – The lenght of sequences generated for Forward fitting.EfN (
int
) – The number of sequences generated for Forward fitting.Eft (
float
) – The tail mass used for Forward fitting.seed (
int
) – The seed to use to initialize the internal random number generator. If0
is given, an arbitrary seed will be chosen based on the current time.ere (
double
, optional) – The relative entropy target for effective number weighting, orNone
.popen (
float
, optional) – The gap open probability to use when building HMMs from single sequences. The default value depends on the alphabet: 0.02 for proteins, 0.03125 for nucleotides.pextend (
float
, optional) – The gap extend probability to use when building HMMs from single sequences. Default depends on the alphabet: 0.4 for proteins, 0.75 for nucleotides.score_matrix (
str
, optional) – The name of the score matrix to use when building HMMs from single sequences. The only allowed value for nucleotide alphabets is DNA1. For proteins, PAM30, PAM70, PAM120, PAM240, BLOSUM45, BLOSUM50, BLOSUM62 (the default), BLOSUM80 or BLOSUM90 can be used.window_length (
float
, optional) – The window length for nucleotide sequences, essentially the max expected hit length. If given, takes precedence overwindow_beta
.window_beta (
float
, optional) – The tail mass at which window length is determined for nucleotide sequences.
- build(sequence, background)¶
Build a new HMM from
sequence
using the builder configuration.- Parameters:
sequence (
DigitalSequence
) – A single biological sequence in digital mode to build a HMM with.background (
Background
) – The background model to use to create the HMM.
- Returns:
(
HMM
,Profile
,OptimizedProfile
) – A tuple containing the new HMM as well as profiles to be used directly in aPipeline
.- Raises:
AlphabetMismatch – When either
sequence
orbackground
have the wrong alphabet for this builder.ValueError – When the HMM cannot be created successfully from the input; the error message contains more details.
Hint
The score matrix and the gap probabilities used here can be set when initializing the builder, or changed by setting a new value to the right attribute:
>>> alphabet = easel.Alphabet.amino() >>> background = plan7.Background(alphabet) >>> builder = plan7.Builder(alphabet, popen=0.05) >>> builder.score_matrix = "BLOSUM90" >>> hmm, _, _ = builder.build(proteins[0], background)
Changed in version 0.4.6: Sets the
HMM.creation_time
attribute with the current time.Changed in version 0.4.10: The score system is now cached between calls to
Builder.build
.
- build_msa(msa, background)¶
Build a new HMM from
msa
using the builder configuration.- Parameters:
msa (
DigitalMSA
) – A multiple sequence alignment in digital mode to build a HMM with.background (
Background
) – The background model to use to create the HMM.
- Returns:
(
HMM
,Profile
,OptimizedProfile
) – A tuple containing the new HMM as well as profiles to be used directly in aPipeline
.- Raises:
AlphabetMismatch – When either
msa
orbackground
have the wrong alphabet for this builder.ValueError – When the HMM cannot be created successfully from the input; the error message contains more details.
Caution
HMMER requires that every HMM has a name, so the
Builder
will attempt to use the name of the sequence to name the HMM. Passing an MSA without a name will result in an error:>>> alphabet = easel.Alphabet.amino() >>> msa = easel.TextMSA(sequences=[ ... easel.TextSequence(name=b"ustiA", sequence="YAIG"), ... easel.TextSequence(name=b"ustiB", sequence="YVIG") ... ]) >>> builder = plan7.Builder(alphabet) >>> background = plan7.Background(alphabet) >>> builder.build_msa(msa.digitize(alphabet), background) Traceback (most recent call last): ... ValueError: Could not build HMM: Unable to name the HMM.
New in version 0.3.0.
Changed in version 0.4.6: Sets the
HMM.creation_time
attribute with the current time.
- seed¶
The seed given at builder initialization.
Setting this attribute to a different value will cause the internal random number generator to be reseeded immediately.
Changed in version 0.4.2: Avoid shadowing initial null seeds given on builder initialization.
- Type:
Background¶
- class pyhmmer.plan7.Background¶
The null background model of HMMER.
Changed in version 0.4.0: Added the
residue_frequencies
attribute.- __init__(alphabet, uniform=False)¶
Create a new background model for the given
alphabet
.- Parameters:
alphabet (
pyhmmer.easel.Alphabet
) – The alphabet to create the background model with.uniform (
bool
) – Whether or not to create the null model with uniform frequencies. Defaults toFalse
.
- copy()¶
Create a copy of the null model with the same parameters.
Results¶
TopHits¶
- class pyhmmer.plan7.TopHits¶
An immutable ranked list of top-scoring hits.
TopHits
are thresholded using the parameters from the pipeline, and are sorted by key when you obtain them from aPipeline
instance:>>> abc = thioesterase.alphabet >>> hits = Pipeline(abc).search_hmm(thioesterase, proteins) >>> hits.is_sorted(by="key") True
Use
len
to query the number of top hits, and the usual indexing notation to extract a particularHit
:>>> len(hits) 1 >>> hits[0].name b'938293.PRJEB85.HG003687_113'
New in version 0.6.1:
pickle
protocol support.- compare_ranking(ranking)¶
Compare current top hits to previous top hits ranking.
This method is used by
jackhmmer
to record the hits obtained during each iteration, so that the inner loop can converge.- Parameters:
ranking (
KeyHash
) – A keyhash containing the ranks of the top hits from a previous run.- Returns:
int
– The number of new hits found in this iteration.
New in version 0.6.0.
- is_sorted(by='key')¶
Check whether or not the hits are sorted with the given method.
See
sort
for a list of allowed values for theby
argument.
- merge(*others)¶
Concatenate the hits from this instance and
others
.If the
Z
anddomZ
values used to compute E-values were computed by thePipeline
from the number of targets, the returned object will update them by summingself.Z
andother.Z
. If they were set manually, the manual value will be kept, provided both values are equal.- Returns:
TopHits
– A new collection of hits containing a copy of all the hits fromself
andother
, sorted by key.- Raises:
ValueError – When trying to merge together several hits obtained from different
Pipeline
with incompatible parameters.
Caution
This should only be done for hits obtained for the same domain on similarly configured pipelines. Some internal checks will be done to ensure this is not the case, but the results may not be consistent at all.
Example
>>> pli = Pipeline(thioesterase.alphabet) >>> hits1 = pli.search_hmm(thioesterase, proteins[:1000]) >>> hits2 = pli.search_hmm(thioesterase, proteins[1000:2000]) >>> hits3 = pli.search_hmm(thioesterase, proteins[2000:]) >>> merged = hits1.merge(hits2, hits3)
New in version 0.5.0.
- sort(by='key')¶
Sort hits in the current instance using the given method.
- Parameters:
by (
str
) – The comparison method to use to compare hits. Allowed values are:key
(the default) to sort by key, orseqidx
to sort by sequence index and alignment position.
- to_msa(alphabet, *, sequences=None, traces=None, trim=False, digitize=False, all_consensus_cols=False)¶
Create multiple alignment of all included domains.
- Parameters:
alphabet (
Alphabet
) – The alphabet of the HMM thisTopHits
was obtained from. It is required to convert back hits to single sequences.sequences (
list
ofSequence
, optional) – A list of additional sequences to include in the alignment.traces (
list
ofTrace
, optional) – A list of additional traces to include in the alignment.
- Keyword Arguments:
trim (
bool
) – Trim off any residues that get assigned to flanking \(N\) and \(C\) states (in profile traces) or \(I_0\) and \(I_m\) (in core traces).digitize (
bool
) – If set toTrue
, returns aDigitalMSA
instead of aTextMSA
.all_consensus_cols (
bool
) – Force a column to be created for every consensus column in the model, even if it means having all gap character in a column.
- Returns:
MSA
– A multiple sequence alignment containing the reported hits, either aTextMSA
or aDigitalMSA
depending on the value of thedigitize
argument.
New in version 0.3.0.
Changed in version 0.6.0: Added the
sequences
andtraces
arguments.
- write(fh, format='targets', header=True)¶
Write the hits in tabular format to a file-like object.
- Parameters:
Hint
The hits can be written in one of the following formats:
targets
A tabular output format of per-target hits, as obtained with the
--tblout
output flag ofhmmsearch
orhmmscan
.domains
A tabular output format of per-domain hits, as obtained with the
--domtblout
output flag ofhmmsearch
orhmmscan
.pfam
A tabular output format suitable for Pfam, merging per-sequence and per-domain hits in a single file, with fewer fields and sorted by score.
New in version 0.6.1.
- bit_cutoffs¶
The model-specific thresholding option, if any.
New in version 0.5.0.
- block_length¶
The block length these hits were obtained with.
Is always
None
when the hits were not obtained from a long targets pipeline.New in version 0.5.0.
- incdomT¶
The per-domain score threshold for including a hit.
New in version 0.5.0.
- long_targets¶
Whether these hits were produced by a long targets pipeline.
New in version 0.5.0.
- Type:
Hit¶
- class pyhmmer.plan7.Hit¶
A high-scoring database hit found by the comparison pipeline.
A hit is obtained in HMMER for every target where one or more significant domain alignment was found by a
Pipeline
. AHit
comes with a score, which is obtained after correcting of the individual bit scores of all its domains; a P-value, which is computed by testing the likelihood to obtain the same alignment using a random background model; and an E-value, which is obtained after Bonferonni correction of the p-value, taking into account the total number of targets in the target database.Hits also store several information as flags.
Hit.included
andHit.reported
show whether aHit
is considered for inclusion (resp. reporting) with respects to the thresholds defined on the originalPipeline
. These flags can be modified manually to force inclusion or exclusion of certains hits independently of their score or E-value. Thewrite
method ofTopHits
objects will only write a line for hits marked as reported. Included hits are necessarily reported:\[\text{included} \implies \text{reported}\]When used during an iterative search, hits can also be marked as dropped by setting the
Hit.dropped
flag toFalse
. Dropped hits will not be used for building HMMs during the next iteration. Hits newly found in an iteration will be marked as new with theHit.new
flag.Hit.dropped
andHit.included
are mutually exclusive, and setting one will unset the other. Dropped hits can be reported, but are not included:\[\text{dropped} \implies \neg \text{included}\]When running a long target pipeline, some hits may appear as duplicates if they were found across multiple windows. These hits will be marked as duplicates with the
Hit.duplicate
flag. Duplicate hits are neither reported nor included:\[\text{duplicate} \implies \neg \text{reported}\]New in version 0.6.1:
pickle
protocol support.
Domains¶
Domain¶
Alignment¶
Traces¶
TraceAligner¶
- class pyhmmer.plan7.TraceAligner¶
A factory for aligning several sequences to a reference model.
Example
>>> aligner = TraceAligner() >>> traces = aligner.compute_traces(thioesterase, proteins[:100]) >>> msa = aligner.align_traces(thioesterase, proteins[:100], traces)
New in version 0.4.7.
- align_traces(hmm, sequences, traces, trim=False, digitize=False, all_consensus_cols=False)¶
Compute traces for a collection of sequences relative to an HMM.
- Parameters:
hmm (
HMM
) – The reference HMM to use for the alignment.sequences (
DigitalSequenceBlock
) – The target sequences to align to the HMM.traces (
Traces
) – The traces corresponding to the alignment ofsequences
tohmm
, obtained by a previous call tocompute_traces
.digitize (
bool
) – If set toTrue
, returns aDigitalMSA
instead of aTextMSA
.trim (
bool
) – Trim off any residues that get assigned to flanking \(N\) and \(C\) states (in profile traces) or \(I_0\) and \(I_m\) (in core traces).all_consensus_cols (
bool
) – Force a column to be created for every consensus column in the model, even if it means having all gap character in a column. Note that this is enabled by default forhmmalign
since HMMER v3.1, but disabled here.
- Returns:
MSA
– A multiple sequence alignment containing the aligned sequences, either aTextMSA
or aDigitalMSA
depending on the value of thedigitize
argument.- Raises:
AlphabetMismatch – when the alphabet of any of the sequences does not correspond to the HMM alphabet.
Changed in version 0.7.0: Targets must now be inside a
DigitalSequenceBlock
.
- compute_traces(hmm, sequences)¶
Compute traces for a collection of sequences relative to an HMM.
- Parameters:
hmm (
HMM
) – The reference HMM to use for the alignment.sequences (
DigitalSequenceBlock
) – The target sequences to align to the HMM.
- Returns:
Traces
– The traces for each sequence.- Raises:
AlphabetMismatch – when the alphabet of any of the sequences does not correspond to the HMM alphabet.
Changed in version 0.7.0: Targets must now be inside a
DigitalSequenceBlock
.
Traces¶
- class pyhmmer.plan7.Traces¶
A list of tracebacks obtained by aligning several sequences to a model.
New in version 0.4.7.
Trace¶
- class pyhmmer.plan7.Trace¶
A traceback for the alignment of a model to a sequence.
New in version 0.4.7.
- __init__(posteriors=False)¶
Create a new
Trace
object.- Parameters:
posteriors (
bool
) – Whether or not to allocate additional memory for the storage of posterior probabilties.
- expected_accuracy()¶
Returns the sum of the posterior residue decoding probabilities.
- from_sequence(sequence)¶
Create a faux trace from a single sequence.
New in version 0.6.0.
Iterative Searches¶
IterativeSearch¶
- class pyhmmer.plan7.IterativeSearch¶
A helper class for running iterative queries like JackHMMER.
See
Pipeline.iterate_seq
andPipeline.iterate_hmm
for more information.- builder¶
The builder object for converting multiple sequence alignments obtained after each run to a
HMM
.- Type:
- query¶
The query object to use for the first iteration.
- Type:
- targets¶
The targets sequences to search for homologs.
- Type:
- Yields:
IterationResult
– A named tuple containing the hits, multiple sequence alignment and HMM for each iteration, as well as the iteration index and a flag marking whether the search converged.
References
Johnson, Steven L., Eddy, Sean R. & Portugaly, Elon. Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinformatics 11, 431 (18 August 2010). doi:10.1186/1471-2105-11-431.
- __init__(*args, **kwargs)¶
IterationResult¶
- class pyhmmer.plan7.IterationResult¶
The results of a single iteration from an
IterativeSearch
.- msa¶
A multiple sequence alignment containing the hits from this iteration.
- Type:
- converged¶
A flag marking whether this iteration converged (no new hit found in the target sequences with respect to the pipeline inclusion thresholds).
- Type:
Miscellaneous¶
Cutoffs¶
- class pyhmmer.plan7.Cutoffs¶
A mutable view over the score cutoffs of a
HMM
or aProfile
.New in version 0.4.6.
- gathering_available()¶
Check whether the gathering thresholds are available.
- noise_available()¶
Check whether the noise cutoffs are available.
- trusted_available()¶
Check whether the trusted cutoffs are available.
- gathering¶
The gathering thresholds, if any.
Example
This property can be used to set the gathering cutoffs by passing it an iterable of two
float
:>>> thioesterase.cutoffs.gathering = (180.0, 120.0) >>> thioesterase.cutoffs.gathering_available() True >>> thioesterase.cutoffs.gathering (180.0, 120.0)
Set the attribute to
None
or delete it withdel
to clear the gathering thresholds:>>> thioesterase.cutoffs.gathering = None >>> thioesterase.cutoffs.gathering_available() False
New in version 0.4.8.
EvalueParameters¶
- class pyhmmer.plan7.EvalueParameters¶
A mutable view over the e-value parameters of a
HMM
or aProfile
.The e-value for each filter is estimated based off a maximum likelihood distribution fitted for each profile HMM, either a Gumbel distribution for the MSV and Viterbi filters, or an exponential distribution for the Forward filter.
New in version 0.4.6.
Offsets¶
- class pyhmmer.plan7.Offsets¶
A mutable view over the disk offsets of a profile.
Transitions¶
- class pyhmmer.plan7.Transitions(enum.IntEnum)¶
A helper enum for indices of the HMM transition probability matrix.
The Plan 7 model architecture used in HMMER describes a HMM which has 3 states and 7 transitions (hence the name) for every node of the model. The HMM can transition from a match state (\(M_n\)) to the next match stage (\(M_{n+1}\)), to an insertion state (\(I_n\)) or a deletion state (\(D_{n+1}\)). However, there are no transitions between a deletion and insertion state.
New in version 0.8.1.
- MM = 0¶
- MI = 1¶
- MD = 2¶
- IM = 3¶
- II = 4¶
- DM = 5¶
- DD = 6¶