Profile Searches#
- pyhmmer.hmmer.hmmsearch(queries, sequences, cpus=0, callback=None, **options)#
Search HMM profiles against a sequence database.
In HMMER many-to-many comparisons, a search is the operation of querying with profile HMMs a database of sequences.
The
hmmsearch
function offers two ways of managing the database that will be selected based on the type of thesequences
argument. Ifsequences
is anSequenceFile
object,hmmsearch
will reopen the file in each thread, and load targets iteratively to scan with the query. Otherwise, it will pre-fetch the target sequences into aDigitalSequenceBlock
collection, and share them across threads without copy. The pre-fetching gives much higher performance at the cost of extra startup time and much higher memory consumption. You may want to check how much memory is available (for instance withpsutil.virtual_memory
) before trying to load a whole sequence database, but it is really recommended to do so whenever possible.- Parameters:
queries (iterable of
HMM
,Profile
orOptimizedProfile
) – The query HMMs or profiles to search for in the database. Note that passing a single object is supported.sequences (iterable of
DigitalSequence
) – A database of sequences to query. If you plan on using the same sequences several times, consider storing them into aDigitalSequenceBlock
directly. If aSequenceFile
is given, profiles will be loaded iteratively from disk rather than prefetched.cpus (
int
) – The number of threads to run in parallel. Pass1
to run everything in the main thread,0
to automatically select a suitable number (usingpsutil.cpu_count
), or any positive number otherwise.callback (callable) – A callback that is called everytime a query is processed with two arguments: the query, and the total number of queries. This can be used to display progress in UI.
backend (
str
) – The parallel backend to use for workers to be executed. Supportsthreading
to use thread-based parallelism, ormultiprocessing
to use process-based parallelism.
- Yields:
TopHits
– An object reporting top hits for each query, in the same order the queries were passed in the input.- Raises:
AlphabetMismatch – When any of the query HMMs and the sequences do not share the same alphabet.
Note
Any additional arguments passed to the
hmmsearch
function will be passed transparently to thePipeline
to be created. For instance, to run ahmmsearch
using a bitscore cutoffs of 5 instead of the default E-value cutoff, use:>>> hits = next(hmmsearch(thioesterase, proteins, T=5)) >>> hits[0].score 8.601...
Since version 0.11.0,
mypy
should be able to detection which keywords can be passed tohmmsearch
using aTypedDict
annotation.Added in version 0.1.0.
Changed in version 0.4.9: Allow using
Profile
andOptimizedProfile
queries.Changed in version 0.7.0: Queries may now be an iterable of different types, or a single object.
- pyhmmer.hmmer.hmmscan(queries, profiles, cpus=0, callback=None, background=None, **options)#
Scan query sequences against a profile database.
In HMMER many-to-many comparisons, a scan is the operation of querying with sequences a database of profile HMMs. It is necessary slower than a search because reconfiguring profiles between each queries has additional overhead, so it’s recommended to use a search if the order of the comparisons is not important.
The
hmmscan
function offers two ways of managing the database that will be selected based on the type of theprofiles
argument. Ifprofiles
is anHMMPressedFile
object,hmmscan
will reopen the file in each thread, and load profiles iteratively to scan with the query. Otherwise, it will pre-fetch the optimized profiles into anOptimizedProfileBlock
collection, and share them across queries. The pre-fetching gives much higher performance at the cost of extra startup time and much higher memory consumption. You may want to check how much memory is available (for instance withpsutil.virtual_memory
) before trying to load a whole pHMM database.- Parameters:
queries (iterable of
DigitalSequence
) – The query sequences to scan with the database. Passing a single query is supported.profiles (iterable of
HMM
,Profile
orOptimizedProfile
) – A database of profiles to query. If you plan on using the same targets several times, consider converting them intoOptimizedProfile
and storing them into anOptimizedProfileBlock
ahead of time. If aHMMPressedFile
is given, profiles will be loaded iteratively from disk rather than prefetched.cpus (
int
) – The number of threads to run in parallel. Pass1
to run everything in the main thread,0
to automatically select a suitable number (usingpsutil.cpu_count
), or any positive number otherwise.callback (callable) – A callback that is called everytime a query is processed with two arguments: the query, and the total number of queries. This can be used to display progress in UI.
background (
pyhmmer.plan7.Background
, optional) – A background object to use for configuring the profiles. IfNone
given, create a default one.backend (
str
) – The parallel backend to use for workers to be executed. Supportsthreading
to use thread-based parallelism, ormultiprocessing
to use process-based parallelism.
- Yields:
TopHits
– An object reporting top hits for each query, in the same order the queries were passed in the input.- Raises:
AlphabetMismatch – When any of the query sequence and the profile do not share the same alphabet.
Note
Any additional keyword arguments passed to the
phmmer
function will be passed transparently to thePipeline
to be created in each worker thread.Hint
If reading the profiles from a pressed HMM database, make sure to use the
HMMFile.optimized_profiles
method so that profiles are read iteratively from the file during the scan loop:>>> with HMMFile("tests/data/hmms/db/PF02826.hmm") as hmm_file: ... targets = hmm_file.optimized_profiles() ... all_hits = list(hmmscan(proteins, targets, E=1e-10)) >>> sum(len(hits) for hits in all_hits) 6
Otherwise, passing
hmm_file
as theprofiles
argument ofhmmscan
would cause the entire HMM file to be loaded in memory into anOptimizedProfileBlock
otherwise.Added in version 0.7.0.