Profile Searches#
- pyhmmer.hmmer.hmmsearch(queries, sequences, cpus=0, callback=None, backend='threading', parallel=None, **options)#
Search HMM profiles against a sequence database.
In HMMER many-to-many comparisons, a search is the operation of querying with profile HMMs a database of sequences.
The
hmmsearchfunction offers two ways of managing the database that will be selected based on the type of thesequencesargument. Ifsequencesis anSequenceFileobject,hmmsearchwill reopen the file in each thread, and load targets iteratively to scan with the query. Otherwise, it will pre-fetch the target sequences into aDigitalSequenceBlockcollection, and share them across threads without copy. The pre-fetching gives much higher performance at the cost of extra startup time and much higher memory consumption. You may want to check how much memory is available (for instance withpsutil.virtual_memory) before trying to load a whole sequence database, but it is really recommended to do so whenever possible.- Parameters:
queries (iterable of
HMM,ProfileorOptimizedProfile) – The query HMMs or profiles to search for in the database. Note that passing a single object is supported.sequences (iterable of
DigitalSequence) – A database of sequences to query. If you plan on using the same sequences several times, consider storing them into aDigitalSequenceBlockdirectly. If aSequenceFileis given, profiles will be loaded iteratively from disk rather than prefetched.cpus (
int) – The number of threads to run in parallel. Pass1to run everything in the main thread,0to automatically select a suitable number (usingpsutil.cpu_count), or any positive number otherwise.callback (callable) – A callback that is called everytime a query is processed with two arguments: the query, and the total number of queries. This can be used to display progress in UI.
backend (
str) – The parallel backend to use for workers to be executed. Supportsthreadingto use thread-based parallelism, ormultiprocessingto use process-based parallelism.parallel (
str) – The parallel strategy to use. Supportsqueriesto run queries in parallel, ortargetsto parallelize on targets while running one query at a time. IfNonegiven, usequeriesby default unless we can detect that there is a single or a small number of queries. Note that parallelization ontargetsdoes not work withSequenceFiletargets.
- Yields:
TopHits– An object reporting top hits for each query, in the same order the queries were passed in the input.- Raises:
AlphabetMismatch – When any of the query HMMs and the sequences do not share the same alphabet.
RuntimeError – When attempting to use
targetsparallel strategy with targets from aSequenceFile.
Note
Any additional arguments passed to the
hmmsearchfunction will be passed transparently to thePipelineto be created. For instance, to run ahmmsearchusing a bitscore cutoffs of 5 instead of the default E-value cutoff, use:>>> hits = next(hmmsearch(thioesterase, proteins, T=5)) >>> hits[0].score 8.601...
Since version 0.11.0,
mypyshould be able to detection which keywords can be passed tohmmsearchusing aTypedDictannotation.Added in version 0.1.0.
Added in version 0.11.1:
parallelargument to select parallelization strategy.Changed in version 0.4.9: Allow using
ProfileandOptimizedProfilequeries.Changed in version 0.7.0: Queries may now be an iterable of different types, or a single object.
- pyhmmer.hmmer.hmmscan(queries, profiles, cpus=0, callback=None, background=None, backend='threading', **options)#
Scan query sequences against a profile database.
In HMMER many-to-many comparisons, a scan is the operation of querying with sequences a database of profile HMMs. It is necessary slower than a search because reconfiguring profiles between each queries has additional overhead, so it’s recommended to use a search if the order of the comparisons is not important.
The
hmmscanfunction offers two ways of managing the database that will be selected based on the type of theprofilesargument. Ifprofilesis anHMMPressedFileobject,hmmscanwill reopen the file in each thread, and load profiles iteratively to scan with the query. Otherwise, it will pre-fetch the optimized profiles into anOptimizedProfileBlockcollection, and share them across queries. The pre-fetching gives much higher performance at the cost of extra startup time and much higher memory consumption. You may want to check how much memory is available (for instance withpsutil.virtual_memory) before trying to load a whole pHMM database.- Parameters:
queries (iterable of
DigitalSequence) – The query sequences to scan with the database. Passing a single query is supported.profiles (iterable of
HMM,ProfileorOptimizedProfile) – A database of profiles to query. If you plan on using the same targets several times, consider converting them intoOptimizedProfileand storing them into anOptimizedProfileBlockahead of time. If aHMMPressedFileis given, profiles will be loaded iteratively from disk rather than prefetched.cpus (
int) – The number of threads to run in parallel. Pass1to run everything in the main thread,0to automatically select a suitable number (usingpsutil.cpu_count), or any positive number otherwise.callback (callable) – A callback that is called everytime a query is processed with two arguments: the query, and the total number of queries. This can be used to display progress in UI.
background (
pyhmmer.plan7.Background, optional) – A background object to use for configuring the profiles. IfNonegiven, create a default one.backend (
str) – The parallel backend to use for workers to be executed. Supportsthreadingto use thread-based parallelism, ormultiprocessingto use process-based parallelism.
- Yields:
TopHits– An object reporting top hits for each query, in the same order the queries were passed in the input.- Raises:
AlphabetMismatch – When any of the query sequence and the profile do not share the same alphabet.
Note
Any additional keyword arguments passed to the
phmmerfunction will be passed transparently to thePipelineto be created in each worker thread.Hint
If reading the profiles from a pressed HMM database, make sure to use the
HMMFile.optimized_profilesmethod so that profiles are read iteratively from the file during the scan loop:>>> with HMMFile("tests/data/hmms/db/PF02826.hmm") as hmm_file: ... targets = hmm_file.optimized_profiles() ... all_hits = list(hmmscan(proteins, targets, E=1e-10, cpus=1)) >>> sum(len(hits) for hits in all_hits) 6
Otherwise, passing
hmm_fileas theprofilesargument ofhmmscanwould cause the entire HMM file to be loaded in memory into anOptimizedProfileBlockotherwise.Added in version 0.7.0.