Run an iterative search to build a HMM for rhodopsins

This examples shows an example workflow for building an HMM from a seed sequence through an iterative search in a sequence database, similar to the JackHMMER methodology. In the second half it will also introduce extra capabilities of the Python wrapper allowing the dynamic modification of the search parameters.

The goal of this example is to build a HMM specific to halorhodopsins, a class of light-activated ion pumps found in Archaea. The difficulty of that task comes from their similarity to bacteriorhodopsins, which have a comparable conserved structure, but different activity.

Classes of rhodopsins

Figure from Zhang et al. (2011).

In a second time we will then build a broader HMM for any kind of rhodopsin, while introducing how to implement a custom hit selection function. This will allow for the automatic exclusion of false positives coming from other organisms, using an additional filter based on taxonomy.

[1]:
import pyhmmer
pyhmmer.__version__
[1]:
'0.6.1'

Downloading sequence data

For this example workflow we will be using SwissProt, directly downloading the sequences in FASTA format from the EMBL-EBI data server.

[2]:
import gzip
import urllib.request

alphabet = pyhmmer.easel.Alphabet.amino()

url = "http://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.fasta.gz"
with urllib.request.urlopen(url) as response:
    reader = gzip.open(response, "rb")
    swissprot = list(pyhmmer.easel.SequenceFile(reader, digital=True, alphabet=alphabet))

The SwissProt sequences all contain the source organism and NCBI Taxonomy accession in their FASTA header, and we will need these pieces of information later. For now, we can build an index that maps a sequence name to its source organism, and another index for taxonomy:

[3]:
import re

taxonomy_by_name = {}
organism_by_name = {}
description_by_name = {}

for seq in swissprot:
    match = re.search(b"(.*) OS=(.*) OX=(\d+)", seq.description)
    taxonomy_by_name[seq.name] = int(match.group(3))
    organism_by_name[seq.name] = match.group(2)
    description_by_name[seq.name] = match.group(1)

Running an iterative search with custom selection

The previous iterative search helped us create a HMM only covering halorhodopsins. However, let’s try to build a very broad HMM for rhodopsin search in Halobacteria. For this round, we do not care about functional specificity, but we care about taxonomy. In particular, looking at the hits from the last iteration, there are some fungal protein among the archeal ones (the opsin from Neurospora crassa, for instance). To build a HMM specific to archeal proteins we cannot rely on a simple score cutoff, so let’s instead build a dedicated function to select hits between each iteration based on taxonomy.

For handling the taxonomy we will be using taxopy, a Python library for reading NCBI Taxonomy files. Let’s start by loading the files so that we have a taxonomy database at hand:

[13]:
import taxopy
taxdb = taxopy.TaxDb(keep_files=False)

Now let’s write a selection function. We still rely on the score inclusion criterion, but we also will want to exclude any hit from an organism that is not in the Halobacteria class.

[14]:
def select_halobacteria_hits(top_hits):
    for hit in top_hits:
        taxid = taxonomy_by_name[hit.name]
        taxon = taxopy.Taxon(taxid, taxdb)
        if hit.is_included() and taxon.rank_name_dictionary.get('class') != "Halobacteria":
            hit.manually_drop()

Now run the iterative search with a lower inclusion threshold, and the selection function we just made:

[15]:
pli = pyhmmer.plan7.Pipeline(alphabet, incT=100, incdomT=100)
search = pli.iterate_seq(hop, swissprot, select_hits=select_halobacteria_hits)
[16]:
iterations = []
while not search.converged:
    iteration = next(search)
    iterations.append(iteration)
    worst_score = min(hit.score for hit in iteration.hits if hit.is_included())
    print(f"Iteration={iteration.iteration} Hits={len(iteration.hits):3} Included={iteration.hits.hits_included:3} Converged={search.converged}")
Iteration=1 Hits= 46 Included= 19 Converged=False
Iteration=2 Hits= 48 Included= 32 Converged=False
Iteration=3 Hits=101 Included= 40 Converged=False
Iteration=4 Hits= 62 Included= 40 Converged=True

The search converges in 4 iterations. Let’s have a look at the score distribution:

[17]:
import matplotlib.pyplot as plt

for iteration in iterations:
    plt.plot([hit.score for hit in iteration.hits], "o-", label=f"Iteration {iteration.iteration}")

plt.legend()
plt.xlabel("Hit rank")
plt.ylabel("Score (bits)")
plt.show()
../_images/examples_iterative_search_36_0.svg

We can make sure that we are not excluding any rhodopsin from Halobacteria by having a look at the excluded hits from the last iteration:

[18]:
for hit in iterations[-1].hits:
    if not hit.is_included():
        print(f"{hit.name.decode():<24}\t{hit.score:5.1f}\t{description_by_name[hit.name].decode():64}\t{organism_by_name[hit.name][:60].decode()}")
sp|Q9UW81|NOP1_NEUCR            163.3   Opsin-1                                                                 Neurospora crassa (strain ATCC 24698 / 74-OR23-1A / CBS 708.
sp|O74631|FD123_TRAVE            88.2   Protein FDD123                                                          Trametes versicolor
sp|O74870|MUG73_SCHPO            73.5   Meiotically up-regulated gene 73 protein                                Schizosaccharomyces pombe (strain 972 / ATCC 24843)
sp|Q9F7P4|PRRG_PRB01             60.2   Green-light absorbing proteorhodopsin                                   Gamma-proteobacterium EBAC31A08
sp|Q6J4G7|PRRG_UNKP              60.2   Green-light absorbing proteorhodopsin                                   Unknown prokaryotic organism
sp|Q9AFF7|PRRB_PRB02             54.5   Blue-light absorbing proteorhodopsin                                    Gamma-proteobacterium Hot 75m4
sp|P38079|YRO2_YEAST             49.6   Protein YRO2                                                            Saccharomyces cerevisiae (strain ATCC 204508 / S288c)
sp|Q12117|MRH1_YEAST             41.0   Protein MRH1                                                            Saccharomyces cerevisiae (strain ATCC 204508 / S288c)
sp|P50657|COX3_ANAPL             14.8   Cytochrome c oxidase subunit 3 (Fragment)                               Anas platyrhynchos
sp|O55747|141R_IIV6              13.7   Uncharacterized protein 141R                                            Invertebrate iridescent virus 6
sp|Q89669|GLYCO_ARV              13.5   Glycoprotein                                                            Adelaide River virus
sp|Q01559|HMDH_NICSY             12.6   3-hydroxy-3-methylglutaryl-coenzyme A reductase                         Nicotiana sylvestris
sp|P32595|GLYCO_BEFVB            12.1   Glycoprotein                                                            Bovine ephemeral fever virus (strain BB7721)
sp|A1VUV1|HOA_POLNA              12.0   4-hydroxy-2-oxovalerate aldolase                                        Polaromonas naphthalenivorans (strain CJ2)
sp|P48020|HMDH1_SOLTU            12.0   3-hydroxy-3-methylglutaryl-coenzyme A reductase 1                       Solanum tuberosum
sp|Q8S2G5|GPDH2_ORYSJ            11.7   Probable glycerol-3-phosphate dehydrogenase [NAD(+)] 2, cytosolic       Oryza sativa subsp. japonica
sp|U5JCC6|HMR2B_PANGI            11.7   3-hydroxy-3-methylglutaryl coenzyme A reductase 2-B                     Panax ginseng
sp|P48022|HMDH2_SOLLC            11.7   3-hydroxy-3-methylglutaryl-coenzyme A reductase 2                       Solanum lycopersicum
sp|Q41437|HMDH2_SOLTU            11.6   3-hydroxy-3-methylglutaryl-coenzyme A reductase 2                       Solanum tuberosum
sp|A0A0A1C930|HMR2A_PANGI        11.5   3-hydroxy-3-methylglutaryl coenzyme A reductase 2-A                     Panax ginseng
sp|Q9XEL8|HMDH2_CAPAN            11.4   3-hydroxy-3-methylglutaryl-coenzyme A reductase 2                       Capsicum annuum
sp|Q8S0G4|GPDH1_ORYSJ            11.4   Probable glycerol-3-phosphate dehydrogenase [NAD(+)] 1, cytosolic       Oryza sativa subsp. japonica

Notice that although it is above inclusion threshold, the opsin from Neurospora crassa stays excluded, as expected from our selection function. Regarding the included targets in the final hits, they should only contain rhodopsins from Halobacteria:

[19]:
for hit in iterations[-1].hits:
    if hit.is_included():
        print(f"{hit.name.decode():<24}\t{hit.score:5.1f}\t{description_by_name[hit.name].decode():34}\t{organism_by_name[hit.name].decode()}")
sp|B0R2U4|BACH_HALS3            392.0   Halorhodopsin                           Halobacterium salinarum (strain ATCC 29341 / DSM 671 / R1)
sp|P0DMH7|BACH_HALSA            392.0   Halorhodopsin                           Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NRC-1)
sp|Q5V1N0|BACH_HALMA            388.6   Halorhodopsin                           Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809)
sp|Q48314|BACH_HALHS            388.1   Halorhodopsin                           Halobacterium halobium (strain shark)
sp|Q48315|BACH_HALHP            388.0   Halorhodopsin                           Halobacterium halobium (strain port)
sp|P94853|BACH_HALVA            385.6   Cruxhalorhodopsin-3                     Haloarcula vallismortis
sp|P33742|BACH_HALSS            381.5   Halorhodopsin                           Halobacterium sp. (strain SG1)
sp|O93742|BACH_HALSD            376.1   Halorhodopsin                           Halorubrum sodomense
sp|O93741|BACH_HALS4            356.3   Halorhodopsin                           Haloterrigena sp. (strain arg-4)
sp|P15647|BACH_NATPH            352.7   Halorhodopsin                           Natronomonas pharaonis
sp|Q57101|BACR_HALAR            312.2   Cruxrhodopsin-1                         Haloarcula argentinensis
sp|Q5UXY6|BACR1_HALMA           310.4   Bacteriorhodopsin-I                     Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809)
sp|P94854|BACR_HALVA            309.1   Cruxrhodopsin-3                         Haloarcula vallismortis
sp|P29563|BACR2_HALS2           303.1   Archaerhodopsin-2                       Halobacterium sp. (strain aus-2)
sp|G0LFX8|BACR1_HALWC           301.9   Bacteriorhodopsin-I                     Haloquadratum walsbyi (strain DSM 16854 / JCM 12705 / C23)
sp|Q18DH8|BACR1_HALWD           301.9   Bacteriorhodopsin-I                     Haloquadratum walsbyi (strain DSM 16790 / HBSQ001)
sp|P69051|BACR1_HALC1           301.3   Archaerhodopsin-1                       Halorubrum chaoviator (strain DSM 11365 / JCM 9573 / AUS-1)
sp|P69052|BACR1_HALSS           301.3   Archaerhodopsin-1                       Halobacterium sp. (strain SG1)
sp|Q53496|BACR_HALSR            299.7   Cruxrhodopsin-2                         Haloarcula sp. (strain arg-2 / Andes heights)
sp|P96787|BACR3_HALSD           299.0   Archaerhodopsin-3                       Halorubrum sodomense
sp|Q5V0R5|BACR2_HALMA           293.1   Bacteriorhodopsin-II                    Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809)
sp|P02945|BACR_HALSA            288.0   Bacteriorhodopsin                       Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NRC-1)
sp|P33970|BACH_HALHM            283.2   Halorhodopsin (Fragment)                Halobacterium halobium (strain mex)
sp|O93740|BACR_HALS4            282.3   Bacteriorhodopsin                       Haloterrigena sp. (strain arg-4)
sp|P33971|BACR_HALHP            278.8   Bacteriorhodopsin (Fragment)            Halobacterium halobium (strain port)
sp|Q18DH5|BACRM_HALWD           277.0   Bacteriorhodopsin-II-like protein       Haloquadratum walsbyi (strain DSM 16790 / HBSQ001)
sp|P33972|BACR_HALHS            275.9   Bacteriorhodopsin (Fragment)            Halobacterium halobium (strain shark)
sp|P33969|BACR_HALHM            262.3   Bacteriorhodopsin (Fragment)            Halobacterium halobium (strain mex)
sp|O93743|BACS_HALSD            234.6   Sensory rhodopsin                       Halorubrum sodomense
sp|P33743|BACS1_HALSS           234.4   Sensory rhodopsin-1                     Halobacterium sp. (strain SG1)
sp|P42196|BACS2_NATPH           230.7   Sensory rhodopsin-2                     Natronomonas pharaonis
sp|Q48334|BAC3_HALVA            229.2   Bacterial rhodopsin CSR3                Haloarcula vallismortis
sp|Q5UXM9|BACS1_HALMA           222.8   Sensory rhodopsin I                     Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809)
sp|B0R633|BACS1_HALS3           222.5   Sensory rhodopsin-1                     Halobacterium salinarum (strain ATCC 29341 / DSM 671 / R1)
sp|P0DMH8|BACS1_HALSA           222.5   Sensory rhodopsin-1                     Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NRC-1)
sp|P42197|BACS2_HALVA           221.1   Sensory rhodopsin-2                     Haloarcula vallismortis
sp|Q5V5V3|BACS2_HALMA           221.1   Sensory rhodopsin II                    Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809)
sp|P71411|BACS2_HALSA           218.6   Sensory rhodopsin-2                     Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NRC-1)
sp|Q5V4H7|BACS3_HALMA           205.9   Sensory rhodopsin III                   Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809)
sp|Q53461|BACH_HALAR            202.7   Cruxhalorhodopsin-1 (Fragment)          Haloarcula argentinensis

Recording bitscore cutoffs

A useful feature of HMMER is the possibility to use HMM-specific bitscore cutoffs when searching sequences with a HMM, instead of relying on E-value cutoffs. Three cutoffs categories are available:

  • Noise cutoffs (NC): The least stringent of the 3 cutoffs, normally set as the highest score seen for negative sequences when gathering members of full alignments.

  • Gathering cutoffs (GA): More stringent than noise cutoffs, normally computed as the cutoffs used when gathering members of full alignments.

  • Trusted cutoffs (TC): The most stringent of the 3 cutoffs, normally set according to the lowest scores seen for true homologous sequences that were above the GA gathering thresholds, when gathering members of full alignments.

[20]:
hmm = iterations[-1].hmm

Let’s first build noise cutoffs based on the negative hits: we can use the highest scoring excluded sequence.

[21]:
from math import ceil

noise_score = max(hit.score for hit in iterations[-1].hits if not hit.is_included())
noise_domscore = max(hit.best_domain.score for hit in iterations[-1].hits if not hit.is_included())
hmm.cutoffs.noise = ceil(noise_score), ceil(noise_domscore)

For gathering, we can use the lowest scoring included sequence:

[22]:
from math import floor

gathering_score = min(hit.score for hit in iterations[-1].hits if hit.is_included())
gathering_domscore = min(hit.best_domain.score for hit in iterations[-1].hits if hit.is_included())
hmm.cutoffs.gathering = floor(gathering_score), floor(gathering_domscore)

For trusted, it’s a little bit less obvious. If you look at the bitscore distribution from the last iteration, we can see 3 distinct group of hits, one in the 390-350 range, one in the 315-260 range, and one in the 235-200 range. We may want to set the trusted cutoffs so that the only the hits in the top range can be trusted.

[23]:
from math import floor

i = max(i for i, hit in enumerate(iterations[-1].hits) if hit.score > 350)
hmm.cutoffs.trusted = floor(iterations[-1].hits[i].score), floor(iterations[-1].hits[i].best_domain.score)

Otherwise, for a more unsupervised approach, we could try to extract a cutoffs by selecting the score of the hits where the derivative is at its maximum:

[24]:
diff = [
    iterations[-1].hits[i-1].score - iterations[-1].hits[i].score
    for i in range(1, len(iterations[-1].hits))
    if iterations[-1].hits[i-1].is_included()
]
[25]:
import matplotlib.pyplot as plt

plt.plot(diff)
plt.ylabel("dscore")
plt.xlabel("Hit rank")
plt.show()
../_images/examples_iterative_search_51_0.svg
[26]:
from math import floor

argmax = max(range(len(diff)), key=diff.__getitem__)
hit = iterations[-1].hits[argmax]
hmm.cutoffs.trusted = floor(hit.score), floor(hit.best_domain.score)

Saving the HMM

Now that we are done creating the HMM, we can save it for later use:

[27]:
hmm.name = b"Rhodopsin"
hmm.description = None
hmm.accession = None
[28]:
with open("rhodopsin.hmm", "wb") as dst_file:
    hmm.write(dst_file)