HMMER¶
Reimplementation of HMMER binaries with the PyHMMER API.
Note
Functions of this module handle parallelization using threads to run searches in parallel for the different queries. If less queries are given, the number of threads will be reduced to avoid spawning idle threads.
hmmsearch¶
- pyhmmer.hmmer.hmmsearch(queries, sequences, cpus=0, callback=None, **options)¶
Search HMM profiles against a sequence database.
In HMMER many-to-many comparisons, a search is the operation of querying with profile HMMs a database of sequences.
The
hmmsearch
function offers two ways of managing the database that will be selected based on the type of thesequences
argument. Ifsequences
is anSequenceFile
object,hmmsearch
will reopen the file in each thread, and load targets iteratively to scan with the query. Otherwise, it will pre-fetch the target sequences into aDigitalSequenceBlock
collection, and share them across threads without copy. The pre-fetching gives much higher performance at the cost of extra startup time and much higher memory consumption. You may want to check how much memory is available (for instance withpsutil.virtual_memory
) before trying to load a whole sequence database, but it is really recommended to do so whenever possible.- Parameters:
queries (iterable of
HMM
,Profile
orOptimizedProfile
) – The query HMMs or profiles to search for in the database. Note that passing a single object is supported.sequences (iterable of
DigitalSequence
) – A database of sequences to query. If you plan on using the same sequences several times, consider storing them into aDigitalSequenceBlock
directly. If aSequenceFile
is given, profiles will be loaded iteratively from disk rather than prefetched.cpus (
int
) – The number of threads to run in parallel. Pass1
to run everything in the main thread,0
to automatically select a suitable number (usingpsutil.cpu_count
), or any positive number otherwise.callback (callable) – A callback that is called everytime a query is processed with two arguments: the query, and the total number of queries. This can be used to display progress in UI.
- Yields:
TopHits
– An object reporting top hits for each query, in the same order the queries were passed in the input.- Raises:
AlphabetMismatch – When any of the query HMMs and the sequences do not share the same alphabet.
Note
Any additional arguments passed to the
hmmsearch
function will be passed transparently to thePipeline
to be created. For instance, to run ahmmsearch
using a bitscore cutoffs of 5 instead of the default E-value cutoff, use:>>> hits = next(hmmsearch(thioesterase, proteins, T=5)) >>> hits[0].score 8.601...
New in version 0.1.0.
Changed in version 0.4.9: Allow using
Profile
andOptimizedProfile
queries.Changed in version 0.7.0: Queries may now be an iterable of different types, or a single object.
hmmscan¶
- pyhmmer.hmmer.hmmscan(queries, profiles, cpus=0, callback=None, background=None, **options)¶
Scan query sequences against a profile database.
In HMMER many-to-many comparisons, a scan is the operation of querying with sequences a database of profile HMMs. It is necessary slower than a search because reconfiguring profiles between each queries has additional overhead, so it’s recommended to use a search if the order of the comparisons is not important.
The
hmmscan
function offers two ways of managing the database that will be selected based on the type of theprofiles
argument. Ifprofiles
is anHMMPressedFile
object,hmmscan
will reopen the file in each thread, and load profiles iteratively to scan with the query. Otherwise, it will pre-fetch the optimized profiles into anOptimizedProfileBlock
collection, and share them across queries. The pre-fetching gives much higher performance at the cost of extra startup time and much higher memory consumption. You may want to check how much memory is available (for instance withpsutil.virtual_memory
) before trying to load a whole pHMM database.- Parameters:
queries (iterable of
DigitalSequence
) – The query sequences to scan with the database. Passing a single query is supported.profiles (iterable of
HMM
,Profile
orOptimizedProfile
) – A database of profiles to query. If you plan on using the same targets several times, consider converting them intoOptimizedProfile
and storing them into anOptimizedProfileBlock
ahead of time. If aHMMPressedFile
is given, profiles will be loaded iteratively from disk rather than prefetched.cpus (
int
) – The number of threads to run in parallel. Pass1
to run everything in the main thread,0
to automatically select a suitable number (usingpsutil.cpu_count
), or any positive number otherwise.callback (callable) – A callback that is called everytime a query is processed with two arguments: the query, and the total number of queries. This can be used to display progress in UI.
background (
pyhmmer.plan7.Background
, optional) – A background object to use for configuring the profiles. IfNone
given, create a default one.
- Yields:
TopHits
– An object reporting top hits for each query, in the same order the queries were passed in the input.- Raises:
AlphabetMismatch – When any of the query sequence and the profile do not share the same alphabet.
Note
Any additional keyword arguments passed to the
phmmer
function will be passed transparently to thePipeline
to be created in each worker thread.Hint
If reading the profiles from a pressed HMM database, make sure to use the
HMMFile.optimized_profiles
method so that profiles are read iteratively from the file during the scan loop:>>> with HMMFile("tests/data/hmms/db/t2pks.hmm") as hmm_file: ... targets = hmm_file.optimized_profiles() ... all_hits = list(hmmscan(proteins, targets, E=1e-10)) >>> sum(len(hits) for hits in all_hits) 26
Otherwise, passing
hmm_file
as theprofiles
argument ofhmmscan
would cause the entire HMM file to be loaded in memory into anOptimizedProfileBlock
otherwise.New in version 0.7.0.
phmmer¶
- pyhmmer.hmmer.phmmer(queries, sequences, cpus=0, callback=None, builder=None, **options)¶
Search protein sequences against a sequence database.
- Parameters:
queries (iterable of
DigitalSequence
orDigitalMSA
) – The query sequences to search for in the sequence database. Passing a single object is supported.sequences (iterable of
DigitalSequence
) – A database of sequences to query. If you plan on using the same sequences several times, consider storing them into aDigitalSequenceBlock
directly. If aSequenceFile
is given, profiles will be loaded iteratively from disk rather than prefetched.cpus (
int
) – The number of threads to run in parallel. Pass1
to run everything in the main thread,0
to automatically select a suitable number (usingpsutil.cpu_count
), or any positive number otherwise.callback (callable) – A callback that is called everytime a query is processed with two arguments: the query, and the total number of queries. This can be used to display progress in UI.
builder (
Builder
, optional) – A builder to configure how the queries are converted to HMMs. PassingNone
will create a default instance.
- Yields:
TopHits
– A top hits instance for each query, in the same order the queries were passed in the input.- Raises:
AlphabetMismatch – When any of the query sequence the profile or the optional builder do not share the same alphabet.
Note
Any additional keyword arguments passed to the
phmmer
function will be passed transparently to thePipeline
to be created in each worker thread.New in version 0.2.0.
Changed in version 0.3.0: Allow using
DigitalMSA
queries.Changed in version 0.7.0: Queries may now be an iterable of different types, or a single object.
nhmmer¶
- pyhmmer.hmmer.nhmmer(queries, sequences, cpus=0, callback=None, builder=None, **options)¶
Search nucleotide sequences against a sequence database.
- Parameters:
queries (iterable of
DigitalSequence
,DigitalMSA
,HMM
) – The query sequences or profiles to search for in the sequence database. Passing a single object is supported.sequences (iterable of
DigitalSequence
) – A database of sequences to query. If you plan on using the same sequences several times, consider storing them into aDigitalSequenceBlock
directly. If aSequenceFile
is given, profiles will be loaded iteratively from disk rather than prefetched.cpus (
int
) – The number of threads to run in parallel. Pass1
to run everything in the main thread,0
to automatically select a suitable number (usingpsutil.cpu_count
), or any positive number otherwise.callback (callable) – A callback that is called everytime a query is processed with two arguments: the query, and the total number of queries. This can be used to display progress in UI.
builder (
Builder
, optional) – A builder to configure how the queries are converted to HMMs. PassingNone
will create a default instance.
- Yields:
TopHits
– A top hits instance for each query, in the same order the queries were passed in the input.
Note
Any additional keyword arguments passed to the
nhmmer
function will be passed to theLongTargetsPipeline
created in each worker thread. Thestrand
argument can be used to restrict the search on the direct or reverse strand.Caution
This function is not just
phmmer
for nucleotide sequences; it actually uses aLongTargetsPipeline
internally instead of processing each target sequence in its entirety when searching for hits. This avoids hitting the maximum target size that can be used (100,000 residues), which may be a problem for some larger genomes.New in version 0.3.0.
Changed in version 0.4.9: Allow using
Profile
andOptimizedProfile
queries.Changed in version 0.7.0: Queries may now be an iterable of different types, or a single object.
hmmpress¶
- pyhmmer.hmmer.hmmpress(hmms, output)¶
Press several HMMs into a database.
Calling this function will create 4 files at the given location:
{output}.h3p
(containing the optimized profiles),{output}.h3m
(containing the binary HMMs),{output}.h3f
(containing the MSV parameters), and{output}.h3i
(the SSI index mapping the previous files).- Parameters:
hmms (iterable of
HMM
) – The HMMs to be pressed together in the file.output (
str
oros.PathLike
) – The path to an output location where to write the different files.
hmmalign¶
- pyhmmer.hmmer.hmmalign(hmm, sequences, trim=False, digitize=False, all_consensus_cols=True)¶
Align several sequences to a reference HMM, and return the MSA.
- Parameters:
hmm (
HMM
) – The reference HMM to use for the alignment.sequences (iterable of
DigitalSequence
) – The sequences to align to the HMM. If you plan on using the same sequences several times, consider storing them into aDigitalSequenceBlock
directly.trim (
bool
) – Trim off any residues that get assigned to flanking \(N\) and \(C\) states (in profile traces) or \(I_0\) and \(I_m\) (in core traces).digitize (
bool
) – If set toTrue
, returns aDigitalMSA
instead of aTextMSA
.all_consensus_cols (
bool
) – Force a column to be created for every consensus column in the model, even if it means having all gap character in a column.
- Returns:
MSA
– A multiple sequence alignment containing the aligned sequences, either aTextMSA
or aDigitalMSA
depending on the value of thedigitize
argument.
See also
The
TraceAligner
class, which lets you inspect the intermediate tracebacks obtained for each alignment before building a MSA.New in version 0.4.7.
jackhmmer¶
- pyhmmer.hmmer.jackhmmer(queries: DigitalSequence | HMM | Profile | OptimizedProfile | Iterable[DigitalSequence | HMM | Profile | OptimizedProfile], sequences: Iterable[DigitalSequence], *, max_iterations: int | None = 5, select_hits: Callable[[TopHits], None] | None = None, checkpoints: Literal[True], cpus: int = 0, callback: Callable[[DigitalSequence | HMM | Profile | OptimizedProfile, int], None] | None = None, builder: Builder | None = None, **options) Iterator[Iterable[IterationResult]] ¶
- pyhmmer.hmmer.jackhmmer(queries: DigitalSequence | HMM | Profile | OptimizedProfile | Iterable[DigitalSequence | HMM | Profile | OptimizedProfile], sequences: Iterable[DigitalSequence], *, max_iterations: int | None = 5, select_hits: Callable[[TopHits], None] | None = None, checkpoints: Literal[False], cpus: int = 0, callback: Callable[[DigitalSequence | HMM | Profile | OptimizedProfile, int], None] | None = None, builder: Builder | None = None, **options) Iterator[IterationResult]
- pyhmmer.hmmer.jackhmmer(queries: DigitalSequence | HMM | Profile | OptimizedProfile | Iterable[DigitalSequence | HMM | Profile | OptimizedProfile], sequences: Iterable[DigitalSequence], *, max_iterations: int | None = 5, select_hits: Callable[[TopHits], None] | None = None, checkpoints: bool = False, cpus: int = 0, callback: Callable[[DigitalSequence | HMM | Profile | OptimizedProfile, int], None] | None = None, builder: Builder | None = None, **options) Iterator[IterationResult] | Iterator[Iterable[IterationResult]]
Search protein sequences against a sequence database.
- Parameters:
queries (iterable of
DigitalSequence
) – The query sequences to search for in the sequence database. Passing a single sequence object is supported.sequences (iterable of
DigitalSequence
) – A database of sequences to query. If you plan on using the same sequences several times, consider storing them into aDigitalSequenceBlock
directly.jackhmmer
does not support passing aSequenceFile
at the moment.max_iterations (
int
) – The maximum number of iterations for the search. Hits will be returned early if the searched converged.select_hits (callable, optional) – A function or callable object for manually selecting hits during each iteration. It should take a single
TopHits
argument and change the inclusion of individual hits with theinclude
anddrop
methods ofHit
objects.checkpoints (
bool
) – A logical flag to return the results at each iteration ‘checkpoint’. IfTrue
, then an iterable of up tomax_iterations
IterationResult
will be returned, rather than just the final iteration. This is similar to--chkhmm
amd--chkali
flags from HMMER3’sjackhmmer
interface.cpus (
int
) – The number of threads to run in parallel. Pass1
to run everything in the main thread,0
to automatically select a suitable number (usingpsutil.cpu_count
), or any positive number otherwise.callback (callable) – A callback that is called everytime a query is processed with two arguments: the query, and the total number of queries. This can be used to display progress in UI.
builder (
Builder
, optional) – A builder to configure how the queries are converted to HMMs. PassingNone
will create a default instance.
- Yields:
IterationResult
– An iteration result instance for each query, in the same order the queries were passed in the input. Ifcheckpoint
option isTrue
, all iterations will be returned instead of the last one.- Raises:
AlphabetMismatch – When any of the query sequence the profile or the optional builder do not share the same alphabet.
Note
Any additional keyword arguments passed to the
jackhmmer
function will be passed transparently to thePipeline
to be created in each worker thread.Caution
Default values used for
jackhmmer
do not correspond to the default parameters used for creating a pipeline in the other cases. If no parameter value is given as a keyword argument,jackhmmer
will create the pipeline withincE=0.001
andincdomE=0.001
, where a defaultPipeline
would useincE=0.01
andincdomE=0.01
.New in version 0.8.0.