Performance tips and tricks

HMMER is a very comprehensive tool, and PyHMMER mimicks most of the internals quite closely, so the API can be a little overwhelming. There is often more than one way to perform the same operation, with various level of configuration, so it can be hard to know how to make the most of it.

Storing HMMs

HMMs can be stored in two different formats, text and a binary, which are equivalent. The hmmconvert binary from HMMER can be used to convert from one to the other; in PyHMMER, you can load both formats with an HMMFile, and write either of the two with the HMM.write method. The text HMM files often receive the .hmm extension while the binary HMM files have the .h3m extension.

The binary format is platform-specific: it depends on the endianness of the local platform, so for distributing HMM files, the text format should be prefered. The main advantage of the binary format, however, is that it is much smaller (often half the size of the text format) and much faster to load. We can quickly see the difference by trying to load the same HMMs: loading a binary file is at least 10x faster.

[1]:
import pyhmmer
%timeit list(pyhmmer.plan7.HMMFile("data/hmms/txt/t2pks.hmm"))
%timeit list(pyhmmer.plan7.HMMFile("data/hmms/bin/t2pks.h3m"))
99.6 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.85 ms ± 2.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Using a binary HMM file is thus recommended when the HMMs are going to be read a lot, for instance when developping a pipeline that will have to load the HMMs every time it’s invoked.

Search and scan

The core of HMMER is a comparison pipeline that tries to match a single profile HMM to a single sequence. To turn this into a many-to-many comparison pipeline, the single comparison is run inside of two loops (for every query, for every target, compare the query to the target). HMMER refers to it as a “scan” when the inner loop iterates over profiles HMMs, and a “search” when the inner loop iterates over sequences. They are equivalent, and the raw scores computed by one or the others will be the same, with the exception of the E-value (more on that later):

[2]:
import time
import pyhmmer

with pyhmmer.plan7.HMMFile("data/hmms/bin/t2pks.h3m") as hmms:
    with pyhmmer.easel.SequenceFile("data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seqs:
        t1 = time.time()
        total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs))
        print(f"- hmmsearch found a total of {total} hits in {time.time() - t1:.3} seconds")

with pyhmmer.plan7.HMMFile("data/hmms/bin/t2pks.h3m") as hmms:
    with pyhmmer.easel.SequenceFile("data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seqs:
        t1 = time.time()
        total = sum(len(hits) for hits in pyhmmer.hmmer.hmmscan(seqs, hmms))
        print(f"- hmmscan found a total of {total} hits in {time.time() - t1:.3} seconds")
- hmmsearch found a total of 97 hits in 0.869 seconds
- hmmscan found a total of 97 hits in 0.8 seconds

Scanning, however, is always slower than searching whenever there is more than one HMM to process. Reconfiguring the internal comparison pipeline for a new HMM is a costly operation, while reconfiguring a pipeline for a new sequence is almost free. During a “search”, the reconfiguration is done inside the outer loop, but during a “scan” the reconfiguration is done inside the inner loop. To reduce the overhead, it’s recommended to use “search” whenever possible, even if that means untangling the results later.

Switching from a “search” to a “scan” means however changing what HMMER considers the database to be, which affects the Z parameter of the pipeline, and ultimately the E-value being computed for each alignment. There are two ways to take care of this:

  • Manually set the Z parameter to the size of what you database is: for instance, if you annotate with Pfam v35.0 which contains 19,632 HMMs, you can manually set the Z parameter to 19632 and use hmmsearch or hmmscan indifferently to produce the same results.

  • Use bitscores and p-values instead of E-values to ensure reproducibility of the results even if the size of your targets changes.

Pre-fetching the target database

In the original HMMER implementation, the target database is always read iteratively from disk for every query. This means the computation can become I/O bound if the target database is not read fast enough, and adds some overhead since the targets must be read again for every new query. In PyHMMER, you have the choice of how you want your targets to be handled: hmmsearch, hmmscan and phmmer accept being given targets both inside a file (a SequenceFile in digital mode for hmmsearch and phmmer, a HMMPressedFile for hmmscan) or inside a block with all targets already loaded in memory (a DigitalSequenceBlock for hmmsearch and phmmer, an OptimizedProfileBlock for hmmscan).

[3]:
# loading targets iteratively - slow, but no extra memory needed
t1 = time.time()
with pyhmmer.plan7.HMMFile("data/hmms/bin/t2pks.h3m") as hmms:
    with pyhmmer.easel.SequenceFile("data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seqs:
        total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs))
        print(f"- hmmsearch without prefetching took {time.time() - t1:.3} seconds")

# pre-fetching targets - fast, but needs the whole target database in memory
t1 = time.time()
with pyhmmer.easel.SequenceFile("data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seq_file:
    seqs = seq_file.read_block()
with pyhmmer.plan7.HMMFile("data/hmms/bin/t2pks.h3m") as hmms:
    total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs))
    print(f"- hmmsearch with prefetching took {time.time() - t1:.3} seconds")
- hmmsearch without prefetching took 0.864 seconds
- hmmsearch with prefetching took 0.711 seconds

The difference is not huge for t2pks, which only contains 40 individual HMMs, but it scales with the total number of queries, and can make a lot of difference when searching target sequences with a large profile library such as Pfam. Therefore, you should try to load your whole sequence database into memory when possible to get actual speed improvements. When loaded into memory, an OptimizedSequenceBlock is about the size of a FASTA formatted sequence file, with some additional overhead when the file has plenty of smaller sequences (like a protein file, typically):

[4]:
import sys
import os

size = os.stat("data/seqs/PKSI.faa").st_size
print(f"Size of FASTA file: {size / 1024:.1f} KiB")

with pyhmmer.easel.SequenceFile("data/seqs/PKSI.faa", digital=True) as seq_file:
    seqs = seq_file.read_block()
    print(f"Size of block storage: {sys.getsizeof(seqs)/1024:.1f} KiB")
    print(f"Size of sequence storage: {sum(sys.getsizeof(seq) for seq in seqs)/1024:.1f} KiB")
Size of FASTA file: 17.2 KiB
Size of block storage: 0.1 KiB
Size of sequence storage: 20.4 KiB

To make sure the target database can be loaded into memory, you can use psutil (which is a PyHMMER dependency anyway) to detect the amount of available or total memory, and only try to load the targets if they are not to big. For instance, with the following code, you would pre-fetch targets if they take less that a 10% of the total memory of the local machine: for a consumer laptop with 8GiB of RAM, it means never loading a file larger than 800MiB, which is already much larger than the size of a complete bacterial genome.

[5]:
import os
import sys
import psutil

available_memory = psutil.virtual_memory().available
target_size = os.stat("data/seqs/938293.PRJEB85.HG003687.faa").st_size
print(f"Available memory: {available_memory/1024:.1f} KiB")
print(f"Database on-disk size: {target_size/1024:.1f} KiB")

with pyhmmer.plan7.HMMFile("data/hmms/bin/t2pks.h3m") as hmm_file:
    with pyhmmer.easel.SequenceFile("data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seq_file:
        if target_size < available_memory * 0.1:
            print("Pre-fetching targets into memory")
            targets = seq_file.read_block()
            print(f"Database in-memory size: {(sys.getsizeof(targets) + sum(sys.getsizeof(target) for target in targets))/1024:.1f} KiB")
        else:
            targets = seq_file
        total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmm_file, targets))
        print(f"Found a total of {total} hits in target sequences")
Available memory: 6615580.0 KiB
Database on-disk size: 958.5 KiB
Pre-fetching targets into memory
Database in-memory size: 1726.1 KiB
Found a total of 97 hits in target sequences

Note

While as a rule of thumb a “search” should be prefered to reduce the profile reconfiguration overhead, there is one exception: when the target sequences can’t fit in memory while the HMM queries could. For instance, to annotate 1,000,000 proteins with Pfam, it will be much more efficient to run a “scan” with the Pfam HMMs pre-loaded in memory (~2 hours with 32 threads), than to run a “search” with the target sequences being read from the filesystem (~9.5 hours with 32 threads). Having already the target sequences or HMMs in memory will always result in a massive speed gain provided the machine has enough RAM.