Create a Python package with embedded HMMs

Overview

Let’s suppose we are developing a Python package for detecting and analyzing thioredoxins, where PyHMMER is used to identify thioredoxin domains. We want to distribute the package and all the required HMMs so that the end users can simply use the functions without having to download additional data.

Hint

If you’re new to developing Python packages, you may want to have a look at the Python Packaging User Guide.

For this example, we will be using pre-made HMMs downloaded from InterPro, but this would work similarly if we were to use custom-made hmms.

Caution

Make sure that you have the permission to redistribute the HMMs along with your code! For instance, Pfam is licensed under Creative Commons Zero which places the data in the public domain, but SMART is licensed by EMBLEM and does not allow redistribution.

Folder structure

Let’s start with a sample project using setuptools as the build backend. Given an example package we would have the following folder structure:

.
├── LICENSE
├── README.md
├── setup.cfg
├── setup.py
├── redox_detector/
│   ├── __init__.py
│   └── search.py
└── tests/

The easiest way to store the HMMs is to have them right next to the Python files that will be using them: for instance, suppose we want to use the TIGR01068 (thioredoxin) HMM, we can simply download it and put it in the main module folder:

.
├── LICENSE
├── README.md
├── setup.cfg
├── setup.py
├── redox_detector/
│   ├── __init__.py
│   ├── search.py
│   └── TIGR01068.hmm
└── tests/

Loading data from Python

Since Python 3.7 the standard library contains the importlib.resources module which provides an interface for loading arbitrary package data.

For instance, we can write a function that takes a list of sequences and return only the sequences that contain a thioredoxin domain, using the internal HMM for finding hits:

# search.py

import importlib.resources
from typing import Iterable

from pyhmmer.plan7 import HMMFile
from pyhmmer.easel import Bitfield, TextSequence
from pyhmmer.hmmer import hmmsearch

def filter_thioredoxins(sequences: List[str]):
    # turn the input sequences into DigitalSequence objects
    # (we use the index of the sequence as their name)
    digital_sequences = [
        TextSequence(sequence=seq, name=str(i).encode()).digitize(hmm.alphabet).
        for i, seq in enumerate(sequences)
    ]

    # use a bitmap to record which input sequences have had a hit
    is_thioredoxin = Bitfield.zeros(len(sequences))

    # search sequences using the embedded HMMs with `importlib.resources`
    # (using __name__ as the module name tells `open_binary` to
    # look in the same folder as the Python source file)
    with importlib.resources.open_binary(__name__, "TIGR01068.hmm") as src:
        with pyhmmer.plan7.HMMFile(src) as hmm_file:
            # run the search pipeline and get hits with E-value <= 1e-5
            for hits in hmmsearch(hmm_file, digital_sequences, E=1e-5):
                for hit in hits.included:
                    is_thioredoxin[int(hit.name)] = True

    # return only the sequences that had at least one hit
    return [ seq for i, seq in enumerate(sequences) if is_thioredoxin[i] ]

Hint

In this example we used only a single HMM inside the HMM file, however the code above would work even if the HMM file contained more than one HMM.

Distributing data

Now that the data is ready and that the Python code knows how to load it, all that is left is to make sure the data files are actually picked up by setuptools in the distribution files.

Using the appropriate section in the setup.cfg file, we can instruct setuptools to add any file with the .hmm extension to the distribution files:

[options.package_data]
redox_detector = *.hmm

Now running python setup.py sdist bdist_wheel will produce source and wheel distributions that will contain the HMM file. When these archives are installed, the HMMs will be installed next to the code.

Caution

PyPI has some size limits for uploaded files: by default, a single distribution cannot be larger than 100MB. You will be able to use this method a relatively small number of HMMs (up to a few thousands), but you won’t be able to upload a large HMM database such as Pfam in its entirety.