{ "cells": [ { "cell_type": "markdown", "id": "a916d57f-eb73-454a-8135-642b0030444e", "metadata": {}, "source": [ "# Performance tips and tricks" ] }, { "cell_type": "markdown", "id": "0cdb5bde-9415-42a4-ad31-12410e4c29da", "metadata": {}, "source": [ "HMMER is a very comprehensive tool, and PyHMMER mimicks most of the internals quite closely, so the API can be a little overwhelming. There is often more than one way to perform the same operation, with various level of configuration, so it can be hard to know how to make the most of it." ] }, { "cell_type": "markdown", "id": "4d74dbb5-9485-433f-b174-c7116e412b60", "metadata": {}, "source": [ "## Storing HMMs\n", "\n", "HMMs can be stored in two different formats, *text* and a *binary*, which are equivalent. The `hmmconvert` binary from HMMER can be used to convert from one to the other; in PyHMMER, you can load both formats with an `HMMFile`, and write either of the two with the `HMM.write` method. The text HMM files often receive the `.hmm` extension while the binary HMM files have the `.h3m` extension.\n", "\n", "The binary format is **platform-specific**: it depends on the endianness of the local platform, so for distributing HMM files, the text format should be prefered. The main advantage of the binary format, however, is that it is much smaller (often half the size of the text format) and much faster to load. We can quickly see the difference by trying to load the same HMMs: loading a binary file is at least 10x faster. " ] }, { "cell_type": "code", "execution_count": null, "id": "6ae3572f-2d46-4957-a9cf-26eaed141fca", "metadata": {}, "outputs": [], "source": [ "import pyhmmer\n", "%timeit list(pyhmmer.plan7.HMMFile(\"data/hmms/txt/RREFam.hmm\"))\n", "%timeit list(pyhmmer.plan7.HMMFile(\"data/hmms/bin/RREFam.h3m\"))" ] }, { "cell_type": "markdown", "id": "e6032858-3f1b-4e62-a019-ffde2119a904", "metadata": {}, "source": [ "Using a binary HMM file is thus recommended when the HMMs are going to be read a lot, for instance when developping a pipeline that will have to load the HMMs every time it's invoked." ] }, { "cell_type": "markdown", "id": "ebf8877e-6db1-470d-8ea5-9e1e392f6caf", "metadata": {}, "source": [ "## Search and scan\n", "\n", "The core of HMMER is a comparison pipeline that tries to match a single profile HMM to a single sequence. To turn this into a many-to-many comparison pipeline, the single comparison is run inside of two loops (for every query, for every target, compare the query to the target). HMMER refers to it as a \"scan\" when the inner loop iterates over profiles HMMs, and a \"search\" when the inner loop iterates over sequences. They are equivalent, and the raw scores computed by one or the others will be the same, with the exception of the E-value (more on that later):" ] }, { "cell_type": "code", "execution_count": null, "id": "75bb0c36-2c87-4df3-abbe-ae76325185fd", "metadata": {}, "outputs": [], "source": [ "import time\n", "import pyhmmer\n", "\n", "with pyhmmer.plan7.HMMFile(\"data/hmms/bin/RREFam.h3m\") as hmms:\n", " with pyhmmer.easel.SequenceFile(\"data/seqs/938293.PRJEB85.HG003687.faa\", digital=True) as seqs:\n", " t1 = time.time()\n", " total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs))\n", " print(f\"- hmmsearch found a total of {total} hits in {time.time() - t1:.3} seconds\")\n", " \n", "with pyhmmer.plan7.HMMFile(\"data/hmms/bin/RREFam.h3m\") as hmms:\n", " with pyhmmer.easel.SequenceFile(\"data/seqs/938293.PRJEB85.HG003687.faa\", digital=True) as seqs:\n", " t1 = time.time()\n", " total = sum(len(hits) for hits in pyhmmer.hmmer.hmmscan(seqs, hmms))\n", " print(f\"- hmmscan found a total of {total} hits in {time.time() - t1:.3} seconds\")" ] }, { "cell_type": "markdown", "id": "a0631dd0-c893-4df9-9c2a-67e82b3ed446", "metadata": {}, "source": [ "Scanning, however, is always slower than searching whenever there is more than one HMM to process. Reconfiguring the internal comparison pipeline for a new HMM is a costly operation, while reconfiguring a pipeline for a new sequence is almost free. During a \"search\", the reconfiguration is done inside the outer loop, but during a \"scan\" the reconfiguration is done inside the inner loop. To reduce the overhead, it's recommended to use \"search\" whenever possible, even if that means untangling the results later.\n", "\n", "Switching from a \"search\" to a \"scan\" means however changing what HMMER considers the database to be, which affects the `Z` parameter of the pipeline, and ultimately the E-value being computed for each alignment. There are two ways to take care of this:\n", "\n", "- Manually set the `Z` parameter to the size of what you database is: for instance, if you annotate with Pfam v35.0 which contains 19,632 HMMs, you can manually set the `Z` parameter to `19632` and use `hmmsearch` or `hmmscan` indifferently to produce the same results.\n", "- Use bitscores and p-values instead of E-values to ensure reproducibility of the results even if the size of your targets changes." ] }, { "cell_type": "markdown", "id": "9562b324-1bb1-482e-8381-462cd32dc078", "metadata": {}, "source": [ "## Pre-fetching the target database " ] }, { "cell_type": "markdown", "id": "a47103f3-4f56-4df4-ab36-d5e7a409b54e", "metadata": {}, "source": [ "In the original HMMER implementation, the target database is always read iteratively from disk for every query. This means the computation can become I/O bound if the target database is not read fast enough, and adds some overhead since the targets must be read again for every new query. In PyHMMER, you have the choice of how you want your targets to be handled: `hmmsearch`, `hmmscan` and `phmmer` accept being given targets both inside a file (a `SequenceFile` in digital mode for `hmmsearch` and `phmmer`, a `HMMPressedFile` for `hmmscan`) or inside a block with all targets already loaded in memory (a `DigitalSequenceBlock` for `hmmsearch` and `phmmer`, an `OptimizedProfileBlock` for `hmmscan`)." ] }, { "cell_type": "code", "execution_count": null, "id": "d99e5d4c-e336-4899-9faa-b674f79a0789", "metadata": {}, "outputs": [], "source": [ "# loading targets iteratively - slow, but no extra memory needed\n", "t1 = time.time()\n", "with pyhmmer.plan7.HMMFile(\"data/hmms/bin/RREFam.h3m\") as hmms:\n", " with pyhmmer.easel.SequenceFile(\"data/seqs/938293.PRJEB85.HG003687.faa\", digital=True) as seqs:\n", " total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs))\n", " print(f\"- hmmsearch without prefetching took {time.time() - t1:.3} seconds\")\n", " \n", "# pre-fetching targets - fast, but needs the whole target database in memory\n", "t1 = time.time()\n", "with pyhmmer.easel.SequenceFile(\"data/seqs/938293.PRJEB85.HG003687.faa\", digital=True) as seq_file:\n", " seqs = seq_file.read_block()\n", "with pyhmmer.plan7.HMMFile(\"data/hmms/bin/RREFam.h3m\") as hmms:\n", " total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs))\n", " print(f\"- hmmsearch with prefetching took {time.time() - t1:.3} seconds\")" ] }, { "cell_type": "markdown", "id": "8a000d01-26f6-4bd1-9eed-fdcc2c88f20c", "metadata": {}, "source": [ "The difference is not huge for `t2pks`, which only contains 40 individual HMMs, but it scales with the total number of queries, and can make a lot of difference when searching target sequences with a large profile library such as Pfam. Therefore, you should try to load your whole sequence database into memory when possible to get actual speed improvements. When loaded into memory, an `OptimizedSequenceBlock` is about the size of a FASTA formatted sequence file, with some additional overhead when the file has plenty of smaller sequences (like a protein file, typically):" ] }, { "cell_type": "code", "execution_count": null, "id": "665e19df-052d-4448-8516-ccabd3b95513", "metadata": {}, "outputs": [], "source": [ "import sys\n", "import os\n", "\n", "size = os.stat(\"data/seqs/PKSI.faa\").st_size\n", "print(f\"Size of FASTA file: {size / 1024:.1f} KiB\")\n", "\n", "with pyhmmer.easel.SequenceFile(\"data/seqs/PKSI.faa\", digital=True) as seq_file:\n", " seqs = seq_file.read_block()\n", " print(f\"Size of block storage: {sys.getsizeof(seqs)/1024:.1f} KiB\")\n", " print(f\"Size of sequence storage: {sum(sys.getsizeof(seq) for seq in seqs)/1024:.1f} KiB\")" ] }, { "cell_type": "markdown", "id": "84eb4961-286b-44e7-be72-4ca109b0a110", "metadata": {}, "source": [ "To make sure the target database can be loaded into memory, you can use `psutil` (which is a PyHMMER dependency anyway) to detect the amount of available or total memory, and only try to load the targets if they are not to big. For instance, with the following code, you would pre-fetch targets if they take less that a 10% of the total memory of the local machine: for a consumer laptop with 8GiB of RAM, it means never loading a file larger than 800MiB, which is already much larger than the size of a complete bacterial genome." ] }, { "cell_type": "code", "execution_count": null, "id": "270385fa-3649-4417-a778-5aab8725e141", "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "import psutil\n", "\n", "available_memory = psutil.virtual_memory().available\n", "target_size = os.stat(\"data/seqs/938293.PRJEB85.HG003687.faa\").st_size\n", "print(f\"Available memory: {available_memory/1024:.1f} KiB\")\n", "print(f\"Database on-disk size: {target_size/1024:.1f} KiB\")\n", "\n", "with pyhmmer.plan7.HMMFile(\"data/hmms/bin/RREFam.h3m\") as hmm_file:\n", " with pyhmmer.easel.SequenceFile(\"data/seqs/938293.PRJEB85.HG003687.faa\", digital=True) as seq_file:\n", " if target_size < available_memory * 0.1:\n", " print(\"Pre-fetching targets into memory\")\n", " targets = seq_file.read_block()\n", " print(f\"Database in-memory size: {(sys.getsizeof(targets) + sum(sys.getsizeof(target) for target in targets))/1024:.1f} KiB\")\n", " else:\n", " targets = seq_file\n", " total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmm_file, targets))\n", " print(f\"Found a total of {total} hits in target sequences\")\n", " " ] }, { "cell_type": "markdown", "id": "257da004-09b9-4ba2-adad-cf67e4d8e767", "metadata": {}, "source": [ "