Results#

class pyhmmer.plan7.TopHits#

An immutable ranked list of top-scoring hits.

TopHits are thresholded using the parameters from the pipeline, and are sorted by key when you obtain them from a Pipeline instance:

>>> abc = thioesterase.alphabet
>>> hits = Pipeline(abc).search_hmm(thioesterase, proteins)
>>> hits.is_sorted(by="key")
True

Use len to query the number of top hits, and the usual indexing notation to extract a particular Hit:

>>> len(hits)
1
>>> hits[0].name
b'938293.PRJEB85.HG003687_113'

Added in version 0.6.1: pickle protocol support.

__init__(query)#

Create an empty TopHits instance.

compare_ranking(ranking)#

Compare current top hits to previous top hits ranking.

This method is used by jackhmmer to record the hits obtained during each iteration, so that the inner loop can converge.

Parameters:

ranking (KeyHash) – A keyhash containing the ranks of the top hits from a previous run.

Returns:

int – The number of new hits found in this iteration.

Added in version 0.6.0.

copy()#

Create a copy of this TopHits instance.

Added in version 0.5.0.

is_sorted(by='key')#

Check whether or not the hits are sorted with the given method.

See sort for a list of allowed values for the by argument.

merge(*others)#

Concatenate the hits from this instance and others.

If the Z and domZ values used to compute E-values were computed by the Pipeline from the number of targets, the returned object will update them by summing self.Z and other.Z. If they were set manually, the manual value will be kept, provided both values are equal.

Returns:

TopHits – A new collection of hits containing a copy of all the hits from self and other, sorted by key.

Raises:

ValueError – When trying to merge together several hits obtained from different Pipeline with incompatible parameters.

Caution

This should only be done for hits obtained for the same domain on similarly configured pipelines. Some internal checks will be done to ensure this is not the case, but the results may not be consistent at all.

Example

>>> pli = Pipeline(thioesterase.alphabet)
>>> hits1 = pli.search_hmm(thioesterase, proteins[:1000])
>>> hits2 = pli.search_hmm(thioesterase, proteins[1000:2000])
>>> hits3 = pli.search_hmm(thioesterase, proteins[2000:])
>>> merged = hits1.merge(hits2, hits3)

Added in version 0.5.0.

sort(by='key')#

Sort hits in the current instance using the given method.

Parameters:

by (str) – The comparison method to use to compare hits. Allowed values are: key (the default) to sort by key, or seqidx to sort by sequence index and alignment position.

to_msa(alphabet, sequences=None, traces=None, trim=False, digitize=False, all_consensus_cols=False)#

Create multiple alignment of all included domains.

Parameters:
  • alphabet (Alphabet) – The alphabet of the HMM this TopHits was obtained from. It is required to convert back hits to single sequences.

  • sequences (list of Sequence, optional) – A list of additional sequences to include in the alignment.

  • traces (list of Trace, optional) – A list of additional traces to include in the alignment.

Keyword Arguments:
  • trim (bool) – Trim off any residues that get assigned to flanking \(N\) and \(C\) states (in profile traces) or \(I_0\) and \(I_m\) (in core traces).

  • digitize (bool) – If set to True, returns a DigitalMSA instead of a TextMSA.

  • all_consensus_cols (bool) – Force a column to be created for every consensus column in the model, even if it means having all gap character in a column.

Returns:

MSA – A multiple sequence alignment containing the reported hits, either a TextMSA or a DigitalMSA depending on the value of the digitize argument.

Added in version 0.3.0.

Changed in version 0.6.0: Added the sequences and traces arguments.

write(fh, format='targets', header=True)#

Write the hits in tabular format to a file-like object.

Parameters:
  • fh (io.IOBase) – A Python file handle, opened in binary mode.

  • format (str) – The tabular format in which to write the hits.

  • header (bool) – Whether to write a table header. Ignored when writing in the pfam format.

Hint

The hits can be written in one of the following formats:

targets

A tabular output format of per-target hits, as obtained with the --tblout output flag of hmmsearch or hmmscan.

domains

A tabular output format of per-domain hits, as obtained with the --domtblout output flag of hmmsearch or hmmscan.

pfam

A tabular output format suitable for Pfam, merging per-sequence and per-domain hits in a single file, with fewer fields and sorted by score.

Added in version 0.6.1.

E#

The per-target E-value threshold for reporting a hit.

Added in version 0.5.0.

Type:

float

T#

The per-target score threshold for reporting a hit.

Added in version 0.5.0.

Type:

float or None

Z#

The effective number of targets searched.

Type:

float

bit_cutoffs#

The model-specific thresholding option, if any.

Added in version 0.5.0.

Type:

str or None

block_length#

The block length these hits were obtained with.

Is always None when the hits were not obtained from a long targets pipeline.

Added in version 0.5.0.

Type:

int or None

domE#

The per-domain E-value threshold for reporting a hit.

Added in version 0.5.0.

Type:

float

domT#

The per-domain score threshold for reporting a hit.

Added in version 0.5.0.

Type:

float or None

domZ#

The effective number of significant targets searched.

Type:

float

incE#

The per-target E-value threshold for including a hit.

Added in version 0.5.0.

Type:

float

incT#

The per-target score threshold for including a hit.

Added in version 0.4.8.

Type:

float or None

incdomE#

The per-domain E-value threshold for including a hit.

Added in version 0.5.0.

Type:

float

incdomT#

The per-domain score threshold for including a hit.

Added in version 0.5.0.

Type:

float or None

included#

An iterator over the hits marked as included.

Added in version 0.7.0.

Type:

iterator of Hit

long_targets#

Whether these hits were produced by a long targets pipeline.

Added in version 0.5.0.

Type:

bool

mode#

Whether the hits were obtained in scan or search mode.

Added in version 0.9.0.

Type:

str

query#

The query object these hits were obtained for.

The actual type of TopHits.query depends on the query that was given to the Pipeline, or the hmmer function, that created the object:

>>> hits = next(pyhmmer.hmmsearch(thioesterase, proteins))
>>> hits.query is thioesterase
True
Type:

object

query_accession#

The accession of the query, if any.

Added in version 0.6.1.

Deprecated since version 0.10.10: Use TopHits.query to access the original query directly.

Type:

bytes or None

query_length#

The length of the query.

Added in version 0.10.5.

Deprecated since version 0.10.10: Use TopHits.query to access the original query directly.

Type:

int

query_name#

The name of the query, if any.

Added in version 0.6.1.

Deprecated since version 0.10.10: Use TopHits.query to access the original query directly.

Type:

bytes or None

reported#

An iterator over the hits marked as reported.

Added in version 0.7.0.

Type:

iterator of Hit

searched_models#

The number of models searched.

Added in version 0.5.0.

Type:

int

searched_nodes#

The number of model nodes searched.

Added in version 0.5.0.

Type:

int

searched_residues#

The number of residues searched.

Added in version 0.5.0.

Type:

int

searched_sequences#

The number of sequences searched.

Added in version 0.5.0.

Type:

int

strand#

The strand these hits were obtained from.

Is always None when the hits were not obtained from a long targets pipeline, or when the long targets pipeline was configured to search both strands.

Added in version 0.5.0.

Type:

str or None

class pyhmmer.plan7.Hit#

A high-scoring database hit found by the comparison pipeline.

A hit is obtained in HMMER for every target where one or more significant domain alignment was found by a Pipeline. A Hit comes with a score, which is obtained after correcting of the individual bit scores of all its domains; a P-value, which is computed by testing the likelihood to obtain the same alignment using a random background model; and an E-value, which is obtained after Bonferonni correction of the p-value, taking into account the total number of targets in the target database.

Hits also store several information as flags. Hit.included and Hit.reported show whether a Hit is considered for inclusion (resp. reporting) with respects to the thresholds defined on the original Pipeline. These flags can be modified manually to force inclusion or exclusion of certains hits independently of their score or E-value. The write method of TopHits objects will only write a line for hits marked as reported. Included hits are necessarily reported:

\[\text{included} \implies \text{reported}\]

When used during an iterative search, hits can also be marked as dropped by setting the Hit.dropped flag to False. Dropped hits will not be used for building HMMs during the next iteration. Hits newly found in an iteration will be marked as new with the Hit.new flag. Hit.dropped and Hit.included are mutually exclusive, and setting one will unset the other. Dropped hits can be reported, but are not included:

\[\text{dropped} \implies \neg \text{included}\]

When running a long target pipeline, some hits may appear as duplicates if they were found across multiple windows. These hits will be marked as duplicates with the Hit.duplicate flag. Duplicate hits are neither reported nor included:

\[\text{duplicate} \implies \neg \text{reported}\]

Added in version 0.6.1: pickle protocol support.

accession#

The accession of the database hit, if any.

Type:

bytes or None

best_domain#

The best scoring domain in this hit.

Added in version 0.4.2.

Type:

Domain

bias#

The null2 contribution to the uncorrected score.

Type:

float

description#

The description of the database hit, if any.

Type:

bytes or None

domains#

The list of domains aligned to this hit.

Type:

Domains

dropped#

Whether this hit is marked as dropped.

Added in version 0.7.0.

Type:

bool

duplicate#

Whether this hit is marked as duplicate.

Added in version 0.7.0.

Type:

bool

evalue#

The e-value of the hit.

Type:

float

included#

Whether this hit is marked as included.

Added in version 0.7.0.

Type:

bool

length#

The length of the database hit.

Added in version 0.10.5.

Type:

int

name#

The name of the database hit.

Type:

bytes

new#

Whether this hit is marked as new.

Added in version 0.7.0.

Type:

bool

pre_score#

Bit score of the sequence before null2 correction.

Type:

float

pvalue#

The p-value of the bitscore.

Added in version 0.4.2.

Type:

float

reported#

Whether this hit is marked as reported.

Added in version 0.7.0.

Type:

bool

score#

Bit score of the sequence with all domains after correction.

Type:

float

sum_score#

Bit score reconstructed from the sum of domain envelopes.

Added in version 0.4.6.

Type:

float

class pyhmmer.plan7.Domains#

A read-only view over the domains of a single Hit.

hit#

The target hit these domains belong hit.

Type:

Hit

included#

An iterator over included domains only.

Added in version 0.7.0.

Type:

iterator of Domain

reported#

An iterator over reported domains only.

Added in version 0.7.0.

Type:

iterator of Domain

class pyhmmer.plan7.Domain#

A single domain in a query Hit.

hit#

The hit this domains is part of.

Type:

Hit

alignment#

The alignment of this domain to a target sequence.

Type:

Alignment

Added in version 0.6.1: pickle protocol support.

bias#

The null2 score contribution to the domain score.

Type:

float

c_evalue#

The conditional e-value for the domain.

Type:

float

correction#

The null2 score when calculating a per-domain score.

Type:

float

env_from#

The start coordinate of the domain envelope.

Type:

int

env_to#

The end coordinate of the domain envelope.

Type:

int

envelope_score#

The forward score in the envelope, without null2 correction.

Type:

float

i_evalue#

The independent e-value for the domain.

Type:

float

included#

Whether this domain is marked as included.

Added in version 0.7.0.

Type:

bool

pvalue#

The p-value of the domain bitscore.

Type:

float

reported#

Whether this domain is marked as reported.

Added in version 0.7.0.

Type:

bool

score#

The overall score in bits, null2-corrected.

Type:

float

strand#

The strand where the domain is located.

When running a search with the LongTargetsPipeline, both strands of each target sequence are processed (unless disabled), so the domain may be located on either strand, either + or -. For default Pipeline searches, this is always None.

Added in version 0.10.8.

Type:

str or None

class pyhmmer.plan7.Alignment#

An alignment of a sequence to a profile.

domain#

The domain this alignment corresponds to.

Type:

Domain

Added in version 0.6.1: pickle protocol support.

hmm_accession#

The accession of the query, or its name if it has none.

Added in version 0.1.4.

Type:

bytes

hmm_from#

The start coordinate of the alignment in the query HMM.

Type:

int

hmm_length#

The length of the query HMM in the alignment.

Added in version 0.10.5.

Type:

int

hmm_name#

The name of the query HMM.

Type:

bytes

hmm_sequence#

The sequence of the query HMM in the alignment.

Type:

str

hmm_to#

The end coordinate of the alignment in the query HMM.

Type:

int

identity_sequence#

The identity sequence between the query and the target.

Type:

str

posterior_probabilities#

Posterior probability annotation of the alignment.

Added in version 0.10.5.

Type:

str

target_from#

The start coordinate of the alignment in the target sequence.

Type:

int

target_length#

The length of the target sequence in the alignment.

Added in version 0.10.5.

Type:

int

target_name#

The name of the target sequence.

Type:

bytes

target_sequence#

The sequence of the target sequence in the alignment.

Type:

str

target_to#

The end coordinate of the alignment in the target sequence.

Type:

int