Profile Searches#

pyhmmer.hmmer.hmmsearch(queries, sequences, cpus=0, callback=None, backend='threading', parallel=None, **options)#

Search HMM profiles against a sequence database.

In HMMER many-to-many comparisons, a search is the operation of querying with profile HMMs a database of sequences.

The hmmsearch function offers two ways of managing the database that will be selected based on the type of the sequences argument. If sequences is an SequenceFile object, hmmsearch will reopen the file in each thread, and load targets iteratively to scan with the query. Otherwise, it will pre-fetch the target sequences into a DigitalSequenceBlock collection, and share them across threads without copy. The pre-fetching gives much higher performance at the cost of extra startup time and much higher memory consumption. You may want to check how much memory is available (for instance with psutil.virtual_memory) before trying to load a whole sequence database, but it is really recommended to do so whenever possible.

Parameters:

queries (iterable of HMM, Profile or OptimizedProfile) – The query HMMs or profiles to search for in the database. Note that passing a single object is supported.
sequences (iterable of DigitalSequence) – A database of sequences to query. If you plan on using the same sequences several times, consider storing them into a DigitalSequenceBlock directly. If a SequenceFile is given, profiles will be loaded iteratively from disk rather than prefetched.
cpus (int) – The number of threads to run in parallel. Pass 1 to run everything in the main thread, 0 to automatically select a suitable number (using psutil.cpu_count), or any positive number otherwise.
callback (callable) – A callback that is called everytime a query is processed with two arguments: the query, and the total number of queries. This can be used to display progress in UI.
backend (str) – The parallel backend to use for workers to be executed. Supports threading to use thread-based parallelism, or multiprocessing to use process-based parallelism.
parallel (str) – The parallel strategy to use. Supports queries to run queries in parallel, or targets to parallelize on targets while running one query at a time. If None given, use queries by default unless we can detect that there is a single or a small number of queries. Note that parallelization on targets does not work with SequenceFile targets.

Yields:

TopHits – An object reporting top hits for each query, in the same order the queries were passed in the input.

Raises:

AlphabetMismatch – When any of the query HMMs and the sequences do not share the same alphabet.
RuntimeError – When attempting to use targets parallel strategy with targets from a SequenceFile.

Note

Any additional arguments passed to the hmmsearch function will be passed transparently to the Pipeline to be created. For instance, to run a hmmsearch using a bitscore cutoffs of 5 instead of the default E-value cutoff, use:

>>> hits = next(hmmsearch(thioesterase, proteins, T=5))
>>> hits[0].score
8.601...

Since version 0.11.0, mypy should be able to detection which keywords can be passed to hmmsearch using a TypedDict annotation.

Added in version 0.1.0.

Added in version 0.11.1: parallel argument to select parallelization strategy.

Changed in version 0.4.9: Allow using Profile and OptimizedProfile queries.

Changed in version 0.7.0: Queries may now be an iterable of different types, or a single object.

pyhmmer.hmmer.hmmscan(queries, profiles, cpus=0, callback=None, background=None, backend='threading', **options)#

Scan query sequences against a profile database.

In HMMER many-to-many comparisons, a scan is the operation of querying with sequences a database of profile HMMs. It is necessary slower than a search because reconfiguring profiles between each queries has additional overhead, so it’s recommended to use a search if the order of the comparisons is not important.

The hmmscan function offers two ways of managing the database that will be selected based on the type of the profiles argument. If profiles is an HMMPressedFile object, hmmscan will reopen the file in each thread, and load profiles iteratively to scan with the query. Otherwise, it will pre-fetch the optimized profiles into an OptimizedProfileBlock collection, and share them across queries. The pre-fetching gives much higher performance at the cost of extra startup time and much higher memory consumption. You may want to check how much memory is available (for instance with psutil.virtual_memory) before trying to load a whole pHMM database.

Parameters:

queries (iterable of DigitalSequence) – The query sequences to scan with the database. Passing a single query is supported.
profiles (iterable of HMM, Profile or OptimizedProfile) – A database of profiles to query. If you plan on using the same targets several times, consider converting them into OptimizedProfile and storing them into an OptimizedProfileBlock ahead of time. If a HMMPressedFile is given, profiles will be loaded iteratively from disk rather than prefetched.
cpus (int) – The number of threads to run in parallel. Pass 1 to run everything in the main thread, 0 to automatically select a suitable number (using psutil.cpu_count), or any positive number otherwise.
callback (callable) – A callback that is called everytime a query is processed with two arguments: the query, and the total number of queries. This can be used to display progress in UI.
background (pyhmmer.plan7.Background, optional) – A background object to use for configuring the profiles. If None given, create a default one.
backend (str) – The parallel backend to use for workers to be executed. Supports threading to use thread-based parallelism, or multiprocessing to use process-based parallelism.

Yields:

TopHits – An object reporting top hits for each query, in the same order the queries were passed in the input.

Raises:

AlphabetMismatch – When any of the query sequence and the profile do not share the same alphabet.

Note

Any additional keyword arguments passed to the phmmer function will be passed transparently to the Pipeline to be created in each worker thread.

Hint

If reading the profiles from a pressed HMM database, make sure to use the HMMFile.optimized_profiles method so that profiles are read iteratively from the file during the scan loop:

>>> with HMMFile("tests/data/hmms/db/PF02826.hmm") as hmm_file:
...     targets = hmm_file.optimized_profiles()
...     all_hits = list(hmmscan(proteins, targets, E=1e-10, cpus=1))
>>> sum(len(hits) for hits in all_hits)
6

Otherwise, passing hmm_file as the profiles argument of hmmscan would cause the entire HMM file to be loaded in memory into an OptimizedProfileBlock otherwise.

Added in version 0.7.0.