Run an iterative search to build a HMM for rhodopsins

This examples shows an example workflow for building an HMM from a seed sequence through an iterative search in a sequence database, similar to the JackHMMER methodology. In the second half it will also introduce extra capabilities of the Python wrapper allowing the dynamic modification of the search parameters.

The goal of this example is to build a HMM specific to halorhodopsins, a class of light-activated ion pumps found in Archaea. The difficulty of that task comes from their similarity to bacteriorhodopsins, which have a comparable conserved structure, but different activity.

Classes of rhodopsins

Figure from Zhang et al. (2011).

In a second time we will then build a broader HMM for any kind of rhodopsin, while introducing how to implement a custom hit selection function. This will allow for the automatic exclusion of false positives coming from other organisms, using an additional filter based on taxonomy.

[1]:
import pyhmmer
pyhmmer.__version__
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pyhmmer
      2 pyhmmer.__version__

ModuleNotFoundError: No module named 'pyhmmer'

Downloading sequence data

For this example workflow we will be using SwissProt, directly downloading the sequences in FASTA format from the EMBL-EBI data server. The remote file is compressed, however we can decompress the sequences on the fly using the gzip module of the Python standard library.

[ ]:
import gzip
import urllib.request

alphabet = pyhmmer.easel.Alphabet.amino()

url = "http://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.fasta.gz"
with urllib.request.urlopen(url) as response:
    reader = gzip.open(response, "rb")
    with pyhmmer.easel.SequenceFile(reader, digital=True, alphabet=alphabet) as seq_file:
        swissprot = seq_file.read_block()

The SwissProt sequences all contain the source organism and NCBI Taxonomy accession in their FASTA header, and we will need these pieces of information later. For now, we can build an index that maps a sequence name to its source organism, and another index for taxonomy:

[ ]:
import re

taxonomy_by_name = {}
organism_by_name = {}
description_by_name = {}

for seq in swissprot:
    match = re.search(b"(.*) OS=(.*) OX=(\d+)", seq.description)
    taxonomy_by_name[seq.name] = int(match.group(3))
    organism_by_name[seq.name] = match.group(2)
    description_by_name[seq.name] = match.group(1)

Running an iterative search with custom selection

The previous iterative search helped us create a HMM only covering halorhodopsins. However, let’s try to build a very broad HMM for rhodopsin search in Halobacteria. For this round, we do not care about functional specificity, but we care about taxonomy. In particular, looking at the hits from the last iteration, there are some fungal protein among the archeal ones (the opsin from Neurospora crassa, for instance). To build a HMM specific to archeal proteins we cannot rely on a simple score cutoff, so let’s instead build a dedicated function to select hits between each iteration based on taxonomy.

For handling the taxonomy we will be using taxopy, a Python library for reading NCBI Taxonomy files. Let’s start by loading the files so that we have a taxonomy database at hand:

[ ]:
import taxopy
taxdb = taxopy.TaxDb(keep_files=False)

Now let’s write a selection function. We still rely on the score inclusion criterion, but we also will want to exclude any hit from an organism that is not in the Halobacteria class.

[ ]:
def select_halobacteria_hits(top_hits):
    for hit in top_hits:
        taxid = taxonomy_by_name[hit.name]
        taxon = taxopy.Taxon(taxid, taxdb)
        if hit.included and taxon.rank_name_dictionary.get('class') != "Halobacteria":
            hit.dropped = True

Now run the iterative search with a lower inclusion threshold, and the selection function we just made:

[ ]:
pli = pyhmmer.plan7.Pipeline(alphabet, incT=100, incdomT=100)
search = pli.iterate_seq(hop, swissprot, select_hits=select_halobacteria_hits)
[ ]:
iterations = []
while not search.converged:
    iteration = next(search)
    iterations.append(iteration)
    worst_score = min(hit.score for hit in iteration.hits if hit.included)
    print(f"Iteration={iteration.iteration} Hits={len(iteration.hits):3} Included={iteration.hits.hits_included:3} Converged={search.converged}")

The search converges in 4 iterations. Let’s have a look at the score distribution:

[ ]:
import matplotlib.pyplot as plt

for iteration in iterations:
    plt.plot([hit.score for hit in iteration.hits], "o-", label=f"Iteration {iteration.iteration}")

plt.legend()
plt.xlabel("Hit rank")
plt.ylabel("Score (bits)")
plt.show()

We can make sure that we are not excluding any rhodopsin from Halobacteria by having a look at the excluded hits from the last iteration:

[ ]:
for hit in iterations[-1].hits:
    if not hit.included:
        print(f"{hit.name.decode():<24}\t{hit.score:5.1f}\t{description_by_name[hit.name].decode():64}\t{organism_by_name[hit.name][:60].decode()}")

Notice that although it is above inclusion threshold, the opsin from Neurospora crassa stays excluded, as expected from our selection function. Regarding the included targets in the final hits, they should only contain rhodopsins from Halobacteria:

[ ]:
for hit in iterations[-1].hits:
    if hit.included:
        print(f"{hit.name.decode():<24}\t{hit.score:5.1f}\t{description_by_name[hit.name].decode():34}\t{organism_by_name[hit.name].decode()}")

Recording bitscore cutoffs

A useful feature of HMMER is the possibility to use HMM-specific bitscore cutoffs when searching sequences with a HMM, instead of relying on E-value cutoffs. Three cutoffs categories are available:

  • Noise cutoffs (NC): The least stringent of the 3 cutoffs, normally set as the highest score seen for negative sequences when gathering members of full alignments.

  • Gathering cutoffs (GA): More stringent than noise cutoffs, normally computed as the cutoffs used when gathering members of full alignments.

  • Trusted cutoffs (TC): The most stringent of the 3 cutoffs, normally set according to the lowest scores seen for true homologous sequences that were above the GA gathering thresholds, when gathering members of full alignments.

[ ]:
hmm = iterations[-1].hmm

Let’s first build noise cutoffs based on the negative hits: we can use the highest scoring excluded sequence.

[ ]:
from math import ceil

noise_score = max(hit.score for hit in iterations[-1].hits if not hit.included)
noise_domscore = max(hit.best_domain.score for hit in iterations[-1].hits if not hit.included)
hmm.cutoffs.noise = ceil(noise_score), ceil(noise_domscore)

For gathering, we can use the lowest scoring included sequence:

[ ]:
from math import floor

gathering_score = min(hit.score for hit in iterations[-1].hits if hit.included)
gathering_domscore = min(hit.best_domain.score for hit in iterations[-1].hits if hit.included)
hmm.cutoffs.gathering = floor(gathering_score), floor(gathering_domscore)

For trusted, it’s a little bit less obvious. If you look at the bitscore distribution from the last iteration, we can see 3 distinct group of hits, one in the 390-350 range, one in the 315-260 range, and one in the 235-200 range. We may want to set the trusted cutoffs so that the only the hits in the top range can be trusted.

[ ]:
from math import floor

i = max(i for i, hit in enumerate(iterations[-1].hits) if hit.score > 350)
hmm.cutoffs.trusted = floor(iterations[-1].hits[i].score), floor(iterations[-1].hits[i].best_domain.score)

Otherwise, for a more unsupervised approach, we could try to extract a cutoffs by selecting the score of the hits where the derivative is at its maximum:

[ ]:
diff = [
    iterations[-1].hits[i-1].score - iterations[-1].hits[i].score
    for i in range(1, len(iterations[-1].hits))
    if iterations[-1].hits[i-1].included
]
[ ]:
import matplotlib.pyplot as plt

plt.plot(diff)
plt.ylabel("dscore")
plt.xlabel("Hit rank")
plt.show()
[ ]:
from math import floor

argmax = max(range(len(diff)), key=diff.__getitem__)
hit = iterations[-1].hits[argmax]
hmm.cutoffs.trusted = floor(hit.score), floor(hit.best_domain.score)

Saving the HMM

Now that we are done creating the HMM, we can save it for later use:

[ ]:
hmm.name = b"Rhodopsin"
hmm.description = None
hmm.accession = None
[ ]:
with open("rhodopsin.hmm", "wb") as dst_file:
    hmm.write(dst_file)