HMMER

Reimplementation of HMMER binaries with the PyHMMER API.

Note

Functions of this module handle parallelization using threads to run searches in parallel for the different queries. If less queries are given, the number of threads will be reduced to avoid spawning idle threads.

hmmsearch

pyhmmer.hmmer.hmmsearch(queries, sequences, cpus=0, callback=None, **options)

Search HMM profiles against a sequence database.

In HMMER many-to-many comparisons, a search is the operation of querying with profile HMMs a database of sequences.

The hmmsearch function offers two ways of managing the database that will be selected based on the type of the sequences argument. If sequences is an SequenceFile object, hmmsearch will reopen the file in each thread, and load targets iteratively to scan with the query. Otherwise, it will pre-fetch the target sequences into a DigitalSequenceBlock collection, and share them across threads without copy. The pre-fetching gives much higher performance at the cost of extra startup time and much higher memory consumption. You may want to check how much memory is available (for instance with psutil.virtual_memory) before trying to load a whole sequence database, but it is really recommended to do so whenever possible.

Parameters:
  • queries (iterable of HMM, Profile or OptimizedProfile) – The query HMMs or profiles to search for in the database. Note that passing a single object is supported.

  • sequences (iterable of DigitalSequence) – A database of sequences to query. If you plan on using the same sequences several times, consider storing them into a DigitalSequenceBlock directly. If a SequenceFile is given, profiles will be loaded iteratively from disk rather than prefetched.

  • cpus (int) – The number of threads to run in parallel. Pass 1 to run everything in the main thread, 0 to automatically select a suitable number (using psutil.cpu_count), or any positive number otherwise.

  • callback (callable) – A callback that is called everytime a query is processed with two arguments: the query, and the total number of queries. This can be used to display progress in UI.

Yields:

TopHits – An object reporting top hits for each query, in the same order the queries were passed in the input.

Raises:

AlphabetMismatch – When any of the query HMMs and the sequences do not share the same alphabet.

Note

Any additional arguments passed to the hmmsearch function will be passed transparently to the Pipeline to be created. For instance, to run a hmmsearch using a bitscore cutoffs of 5 instead of the default E-value cutoff, use:

>>> hits = next(hmmsearch(thioesterase, proteins, T=5))
>>> hits[0].score
8.601...

Added in version 0.1.0.

Changed in version 0.4.9: Allow using Profile and OptimizedProfile queries.

Changed in version 0.7.0: Queries may now be an iterable of different types, or a single object.

hmmscan

pyhmmer.hmmer.hmmscan(queries, profiles, cpus=0, callback=None, background=None, **options)

Scan query sequences against a profile database.

In HMMER many-to-many comparisons, a scan is the operation of querying with sequences a database of profile HMMs. It is necessary slower than a search because reconfiguring profiles between each queries has additional overhead, so it’s recommended to use a search if the order of the comparisons is not important.

The hmmscan function offers two ways of managing the database that will be selected based on the type of the profiles argument. If profiles is an HMMPressedFile object, hmmscan will reopen the file in each thread, and load profiles iteratively to scan with the query. Otherwise, it will pre-fetch the optimized profiles into an OptimizedProfileBlock collection, and share them across queries. The pre-fetching gives much higher performance at the cost of extra startup time and much higher memory consumption. You may want to check how much memory is available (for instance with psutil.virtual_memory) before trying to load a whole pHMM database.

Parameters:
  • queries (iterable of DigitalSequence) – The query sequences to scan with the database. Passing a single query is supported.

  • profiles (iterable of HMM, Profile or OptimizedProfile) – A database of profiles to query. If you plan on using the same targets several times, consider converting them into OptimizedProfile and storing them into an OptimizedProfileBlock ahead of time. If a HMMPressedFile is given, profiles will be loaded iteratively from disk rather than prefetched.

  • cpus (int) – The number of threads to run in parallel. Pass 1 to run everything in the main thread, 0 to automatically select a suitable number (using psutil.cpu_count), or any positive number otherwise.

  • callback (callable) – A callback that is called everytime a query is processed with two arguments: the query, and the total number of queries. This can be used to display progress in UI.

  • background (pyhmmer.plan7.Background, optional) – A background object to use for configuring the profiles. If None given, create a default one.

Yields:

TopHits – An object reporting top hits for each query, in the same order the queries were passed in the input.

Raises:

AlphabetMismatch – When any of the query sequence and the profile do not share the same alphabet.

Note

Any additional keyword arguments passed to the phmmer function will be passed transparently to the Pipeline to be created in each worker thread.

Hint

If reading the profiles from a pressed HMM database, make sure to use the HMMFile.optimized_profiles method so that profiles are read iteratively from the file during the scan loop:

>>> with HMMFile("tests/data/hmms/db/PF02826.hmm") as hmm_file:
...     targets = hmm_file.optimized_profiles()
...     all_hits = list(hmmscan(proteins, targets, E=1e-10))
>>> sum(len(hits) for hits in all_hits)
6

Otherwise, passing hmm_file as the profiles argument of hmmscan would cause the entire HMM file to be loaded in memory into an OptimizedProfileBlock otherwise.

Added in version 0.7.0.

phmmer

pyhmmer.hmmer.phmmer(queries, sequences, cpus=0, callback=None, builder=None, **options)

Search protein sequences against a sequence database.

Parameters:
  • queries (iterable of DigitalSequence or DigitalMSA) – The query sequences to search for in the sequence database. Passing a single object is supported.

  • sequences (iterable of DigitalSequence) – A database of sequences to query. If you plan on using the same sequences several times, consider storing them into a DigitalSequenceBlock directly. If a SequenceFile is given, profiles will be loaded iteratively from disk rather than prefetched.

  • cpus (int) – The number of threads to run in parallel. Pass 1 to run everything in the main thread, 0 to automatically select a suitable number (using psutil.cpu_count), or any positive number otherwise.

  • callback (callable) – A callback that is called everytime a query is processed with two arguments: the query, and the total number of queries. This can be used to display progress in UI.

  • builder (Builder, optional) – A builder to configure how the queries are converted to HMMs. Passing None will create a default instance.

Yields:

TopHits – A top hits instance for each query, in the same order the queries were passed in the input.

Raises:

AlphabetMismatch – When any of the query sequence the profile or the optional builder do not share the same alphabet.

Note

Any additional keyword arguments passed to the phmmer function will be passed transparently to the Pipeline to be created in each worker thread.

Added in version 0.2.0.

Changed in version 0.3.0: Allow using DigitalMSA queries.

Changed in version 0.7.0: Queries may now be an iterable of different types, or a single object.

nhmmer

pyhmmer.hmmer.nhmmer(queries, sequences, cpus=0, callback=None, builder=None, **options)

Search nucleotide sequences against a sequence database.

Parameters:
  • queries (iterable of DigitalSequence, DigitalMSA, HMM) – The query sequences or profiles to search for in the sequence database. Passing a single object is supported.

  • sequences (iterable of DigitalSequence) – A database of sequences to query. If you plan on using the same sequences several times, consider storing them into a DigitalSequenceBlock directly. If a SequenceFile is given, profiles will be loaded iteratively from disk rather than prefetched.

  • cpus (int) – The number of threads to run in parallel. Pass 1 to run everything in the main thread, 0 to automatically select a suitable number (using psutil.cpu_count), or any positive number otherwise.

  • callback (callable) – A callback that is called everytime a query is processed with two arguments: the query, and the total number of queries. This can be used to display progress in UI.

  • builder (Builder, optional) – A builder to configure how the queries are converted to HMMs. Passing None will create a default instance.

Yields:

TopHits – A top hits instance for each query, in the same order the queries were passed in the input.

Note

Any additional keyword arguments passed to the nhmmer function will be passed to the LongTargetsPipeline created in each worker thread. The strand argument can be used to restrict the search on the direct or reverse strand.

Caution

This function is not just phmmer for nucleotide sequences; it actually uses a LongTargetsPipeline internally instead of processing each target sequence in its entirety when searching for hits. This avoids hitting the maximum target size that can be used (100,000 residues), which may be a problem for some larger genomes.

Added in version 0.3.0.

Changed in version 0.4.9: Allow using Profile and OptimizedProfile queries.

Changed in version 0.7.0: Queries may now be an iterable of different types, or a single object.

hmmpress

pyhmmer.hmmer.hmmpress(hmms, output)

Press several HMMs into a database.

Calling this function will create 4 files at the given location: {output}.h3p (containing the optimized profiles), {output}.h3m (containing the binary HMMs), {output}.h3f (containing the MSV parameters), and {output}.h3i (the SSI index mapping the previous files).

Parameters:
  • hmms (iterable of HMM) – The HMMs to be pressed together in the file.

  • output (str or os.PathLike) – The path to an output location where to write the different files.

hmmalign

pyhmmer.hmmer.hmmalign(hmm, sequences, trim=False, digitize=False, all_consensus_cols=True)

Align several sequences to a reference HMM, and return the MSA.

Parameters:
  • hmm (HMM) – The reference HMM to use for the alignment.

  • sequences (iterable of DigitalSequence) – The sequences to align to the HMM. If you plan on using the same sequences several times, consider storing them into a DigitalSequenceBlock directly.

  • trim (bool) – Trim off any residues that get assigned to flanking \(N\) and \(C\) states (in profile traces) or \(I_0\) and \(I_m\) (in core traces).

  • digitize (bool) – If set to True, returns a DigitalMSA instead of a TextMSA.

  • all_consensus_cols (bool) – Force a column to be created for every consensus column in the model, even if it means having all gap character in a column.

Returns:

MSA – A multiple sequence alignment containing the aligned sequences, either a TextMSA or a DigitalMSA depending on the value of the digitize argument.

See also

The TraceAligner class, which lets you inspect the intermediate tracebacks obtained for each alignment before building a MSA.

Added in version 0.4.7.

jackhmmer

pyhmmer.hmmer.jackhmmer(queries: DigitalSequence | HMM | Profile | OptimizedProfile | Iterable[DigitalSequence | HMM | Profile | OptimizedProfile], sequences: Iterable[DigitalSequence], *, max_iterations: int | None = 5, select_hits: Callable[[TopHits], None] | None = None, checkpoints: Literal[True], cpus: int = 0, callback: Callable[[DigitalSequence | HMM | Profile | OptimizedProfile, int], None] | None = None, builder: Builder | None = None, **options) Iterator[Iterable[IterationResult]]
pyhmmer.hmmer.jackhmmer(queries: DigitalSequence | HMM | Profile | OptimizedProfile | Iterable[DigitalSequence | HMM | Profile | OptimizedProfile], sequences: Iterable[DigitalSequence], *, max_iterations: int | None = 5, select_hits: Callable[[TopHits], None] | None = None, checkpoints: Literal[False], cpus: int = 0, callback: Callable[[DigitalSequence | HMM | Profile | OptimizedProfile, int], None] | None = None, builder: Builder | None = None, **options) Iterator[IterationResult]
pyhmmer.hmmer.jackhmmer(queries: DigitalSequence | HMM | Profile | OptimizedProfile | Iterable[DigitalSequence | HMM | Profile | OptimizedProfile], sequences: Iterable[DigitalSequence], *, max_iterations: int | None = 5, select_hits: Callable[[TopHits], None] | None = None, checkpoints: bool = False, cpus: int = 0, callback: Callable[[DigitalSequence | HMM | Profile | OptimizedProfile, int], None] | None = None, builder: Builder | None = None, **options) Iterator[IterationResult] | Iterator[Iterable[IterationResult]]

Search protein sequences against a sequence database.

Parameters:
  • queries (iterable of DigitalSequence) – The query sequences to search for in the sequence database. Passing a single sequence object is supported.

  • sequences (iterable of DigitalSequence) – A database of sequences to query. If you plan on using the same sequences several times, consider storing them into a DigitalSequenceBlock directly. jackhmmer does not support passing a SequenceFile at the moment.

  • max_iterations (int) – The maximum number of iterations for the search. Hits will be returned early if the searched converged.

  • select_hits (callable, optional) – A function or callable object for manually selecting hits during each iteration. It should take a single TopHits argument and change the inclusion of individual hits with the include and drop methods of Hit objects.

  • checkpoints (bool) – A logical flag to return the results at each iteration ‘checkpoint’. If True, then an iterable of up to max_iterations IterationResult will be returned, rather than just the final iteration. This is similar to --chkhmm amd --chkali flags from HMMER3’s jackhmmer interface.

  • cpus (int) – The number of threads to run in parallel. Pass 1 to run everything in the main thread, 0 to automatically select a suitable number (using psutil.cpu_count), or any positive number otherwise.

  • callback (callable) – A callback that is called everytime a query is processed with two arguments: the query, and the total number of queries. This can be used to display progress in UI.

  • builder (Builder, optional) – A builder to configure how the queries are converted to HMMs. Passing None will create a default instance.

Yields:

IterationResult – An iteration result instance for each query, in the same order the queries were passed in the input. If checkpoint option is True, all iterations will be returned instead of the last one.

Raises:

AlphabetMismatch – When any of the query sequence the profile or the optional builder do not share the same alphabet.

Note

Any additional keyword arguments passed to the jackhmmer function will be passed transparently to the Pipeline to be created in each worker thread.

Caution

Default values used for jackhmmer do not correspond to the default parameters used for creating a pipeline in the other cases. If no parameter value is given as a keyword argument, jackhmmer will create the pipeline with incE=0.001 and incdomE=0.001, where a default Pipeline would use incE=0.01 and incdomE=0.01.

Added in version 0.8.0.