Create a Python package with embedded HMMs#
Overview#
Let’s suppose we are developing a Python package for detecting and analyzing thioredoxins, where PyHMMER is used to identify thioredoxin domains. We want to distribute the package and all the required HMMs so that the end users can simply use the functions without having to download additional data.
Hint
If you’re new to developing Python packages, you may want to have a look at the Python Packaging User Guide.
For this example, we will be using pre-made HMMs downloaded from InterPro, but this would work similarly if we were to use custom-made hmms.
Caution
Make sure that you have the permission to redistribute the HMMs along with your code! For instance, Pfam is licensed under Creative Commons Zero which places the data in the public domain, but SMART is licensed by EMBLEM and does not allow redistribution.
Folder structure#
Let’s start with a sample project using setuptools
as the build backend.
Given an example package we would have the following folder structure:
.
├── LICENSE
├── README.md
├── setup.cfg
├── setup.py
├── redox_detector/
│ ├── __init__.py
│ └── search.py
└── tests/
The easiest way to store the HMMs is to have them right next to the Python files that will be using them: for instance, suppose we want to use the TIGR01068 (thioredoxin) HMM, we can simply download it and put it in the main module folder:
.
├── LICENSE
├── README.md
├── setup.cfg
├── setup.py
├── redox_detector/
│ ├── __init__.py
│ ├── search.py
│ └── TIGR01068.hmm
└── tests/
Loading data from Python#
Since Python 3.7 the standard library contains the importlib.resources
module
which provides an interface for loading arbitrary package data.
For instance, we can write a function that takes a list of sequences and return only the sequences that contain a thioredoxin domain, using the internal HMM for finding hits:
# search.py
import importlib.resources
from typing import Iterable
from pyhmmer.plan7 import HMMFile
from pyhmmer.easel import Bitfield, TextSequence
from pyhmmer.hmmer import hmmsearch
def filter_thioredoxins(sequences: List[str]):
# turn the input sequences into DigitalSequence objects
# (we use the index of the sequence as their name)
digital_sequences = [
TextSequence(sequence=seq, name=str(i).encode()).digitize(hmm.alphabet).
for i, seq in enumerate(sequences)
]
# use a bitmap to record which input sequences have had a hit
is_thioredoxin = Bitfield.zeros(len(sequences))
# search sequences using the embedded HMMs with `importlib.resources`
# (using __name__ as the module name tells `open_binary` to
# look in the same folder as the Python source file)
with importlib.resources.open_binary(__name__, "TIGR01068.hmm") as src:
with pyhmmer.plan7.HMMFile(src) as hmm_file:
# run the search pipeline and get hits with E-value <= 1e-5
for hits in hmmsearch(hmm_file, digital_sequences, E=1e-5):
for hit in hits.included:
is_thioredoxin[int(hit.name)] = True
# return only the sequences that had at least one hit
return [ seq for i, seq in enumerate(sequences) if is_thioredoxin[i] ]
Hint
In this example we used only a single HMM inside the HMM file, however the code above would work even if the HMM file contained more than one HMM.
Distributing data#
Now that the data is ready and that the Python code knows how to load it,
all that is left is to make sure the data files are actually picked up by
setuptools
in the distribution files.
Using the appropriate section in the setup.cfg
file, we can instruct
setuptools
to add any file with the .hmm
extension to the distribution
files:
[options.package_data]
redox_detector = *.hmm
Now running python setup.py sdist bdist_wheel
will produce source and
wheel distributions that will contain the HMM file. When these archives are
installed, the HMMs will be installed next to the code.
Caution
PyPI has some size limits for uploaded files: by default, a single distribution cannot be larger than 100MB. You will be able to use this method a relatively small number of HMMs (up to a few thousands), but you won’t be able to upload a large HMM database such as Pfam in its entirety.