This examples shows an example workflow for building an HMM from a seed sequence through an iterative search in a sequence database, similar to the JackHMMER methodology. In the second half it will also introduce extra capabilities of the Python wrapper allowing the dynamic modification of the search parameters.

The goal of this example is to build a HMM specific to halorhodopsins, a class of light-activated ion pumps found in Archaea. The difficulty of that task comes from their similarity to bacteriorhodopsins, which have a comparable conserved structure, but different activity.

Classes of rhodopsins

Figure from Zhang et al. (2011).

In a second time we will then build a broader HMM for any kind of rhodopsin, while introducing how to implement a custom hit selection function. This will allow for the automatic exclusion of false positives coming from other organisms, using an additional filter based on taxonomy.

References

Run an iterative search to build a HMM for rhodopsins#

[1]:

import pyhmmer
pyhmmer.__version__

[1]:

'0.12.1'

Downloading sequence data#

For this example workflow we will be using SwissProt, directly downloading the sequences in FASTA format from the EMBL-EBI data server. The remote file is compressed, however we can decompress the sequences on the fly using the gzip module of the Python standard library.

[2]:

import gzip
import urllib.request

alphabet = pyhmmer.easel.Alphabet.amino()

url = "https://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.fasta.gz"
with urllib.request.urlopen(url) as response:
    reader = gzip.open(response, "rb")
    with pyhmmer.easel.SequenceFile(reader, digital=True, alphabet=alphabet) as seq_file:
        swissprot = seq_file.read_block()

The SwissProt sequences all contain the source organism and NCBI Taxonomy accession in their FASTA header, and we will need these pieces of information later. For now, we can build an index that maps a sequence name to its source organism, and another index for taxonomy:

[3]:

import re

taxonomy_by_name = {}
organism_by_name = {}
description_by_name = {}

for seq in swissprot:
    match = re.search("(.*) OS=(.*) OX=(\d+)", seq.description)
    taxonomy_by_name[seq.name] = int(match.group(3))
    organism_by_name[seq.name] = match.group(2)
    description_by_name[seq.name] = match.group(1)

Running a simple search#

The sequence we will be using for initial similarity search is the halorhodopsin from Halobacterium salinarum ATCC 29341. First let’s extract it from SwissProt:

[4]:

hop = next(x for x in swissprot if 'B0R2U4' in x.name)

Now let’s see if we can find close homologs to this one protein without losing the specificity. For this, we first need to create a pipeline and to search for the sequence above:

[5]:

pli = pyhmmer.plan7.Pipeline(alphabet)
hits = pli.search_seq(hop, swissprot)

We can visualize the score distribution for the hits to see what we should consider as the inclusion threshold:

[6]:

import matplotlib.pyplot as plt

plt.plot([hit.score for hit in hits], "o-")
plt.xlabel("Hit rank")
plt.ylabel("Score (bits)")
plt.show()

From this graph we can identify three groups of protein in the hits:

The two first hits, with very high scores, that are likely very similar orthologs
A group of more remote orthologs, with a score above 300 bits
A group of very remote orthologs, under the 300 bits score mark.

If we went to run an iterative search with the default Pipeline parameters, all of these hits would be included in the next iteration, and we may lose the specifity. Let’s check the organisms included with a bitscore cutoff of 300:

[7]:

for hit in hits:
    if hit.score >= 100.0:
        print(f"{hit.name:<20}\t{hit.score:5.1f}\t{description_by_name[hit.name]:34}\t{organism_by_name[hit.name]}")

sp|B0R2U4|BACH_HALS3    563.4   Halorhodopsin                           Halobacterium salinarum (strain ATCC 29341 / DSM 671 / R1)
sp|P0DMH7|BACH_HALSA    563.4   Halorhodopsin                           Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NRC-1)
sp|O93741|BACH_HALS4    377.0   Halorhodopsin                           Haloterrigena sp. (strain arg-4)
sp|P94853|BACH_HALVA    353.0   Cruxhalorhodopsin-3                     Haloarcula vallismortis
sp|P33742|BACH_HALSS    351.7   Halorhodopsin                           Halobacterium sp. (strain SG1)
sp|Q5V1N0|BACH_HALMA    348.2   Halorhodopsin                           Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809)
sp|Q48315|BACH_HALHP    348.0   Halorhodopsin                           Halobacterium halobium (strain port)
sp|Q48314|BACH_HALHS    347.7   Halorhodopsin                           Halobacterium halobium (strain shark)
sp|O93742|BACH_HALSD    345.1   Halorhodopsin                           Halorubrum sodomense
sp|P15647|BACH_NATPH    286.5   Halorhodopsin                           Natronomonas pharaonis
sp|P33970|BACH_HALHM    266.8   Halorhodopsin (Fragment)                Halobacterium halobium (strain mex)
sp|Q53461|BACH_HALAR    184.0   Cruxhalorhodopsin-1 (Fragment)          Haloarcula argentinensis
sp|P69051|BACR1_HALEZ   103.3   Archaerhodopsin-1                       Halorubrum ezzemoulense
sp|P69052|BACR1_HALSS   103.3   Archaerhodopsin-1                       Halobacterium sp. (strain SG1)
sp|P02945|BACR_HALSA    103.0   Bacteriorhodopsin                       Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NRC-1)
sp|P96787|BACR3_HALSD   102.0   Archaerhodopsin-3                       Halorubrum sodomense
sp|P29563|BACR2_HALS2   101.7   Archaerhodopsin-2                       Halobacterium sp. (strain aus-2)
sp|G0LFX8|BACR1_HALWC   100.2   Bacteriorhodopsin-I                     Haloquadratum walsbyi (strain DSM 16854 / JCM 12705 / C23)
sp|Q18DH8|BACR1_HALWD   100.2   Bacteriorhodopsin-I                     Haloquadratum walsbyi (strain DSM 16790 / HBSQ001)

Looks like all of these are indeed Halobacteria, so we should be fine running an iterative search with these. However, two hits under the 300 bits are still halorhodopsins according to UniProt annotations, so to keep them included we should lower the inclusion threshold to 250 bits.

Running a conservative iterative search#

Let’s run an iterative search until it converges with a bitscore cutoff of 250, as we selected previously. For this we need to create a new pipeline, and use the Pipeline.iterate_seq method with our query and target sequences:

[8]:

pli = pyhmmer.plan7.Pipeline(alphabet, incT=250, incdomT=250)
search = pli.iterate_seq(hop, swissprot)

Now that we have an iterator, we can keep iterating until it converges (in this context, converging means that no new hits are found by the newly created HMM in comparison to the previous one):

[9]:

iterations = []
while not search.converged:
    iteration = next(search)
    iterations.append(iteration)
    print(f"Iteration={iteration.iteration} Hits={len(iteration.hits):2} Included={len(iteration.hits.included):2} Converged={search.converged}")

Iteration=1 Hits=48 Included=11 Converged=False
Iteration=2 Hits=53 Included=12 Converged=False
Iteration=3 Hits=54 Included=12 Converged=True

We can plot the score distribution for each iteration:

[10]:

import matplotlib.pyplot as plt

for iteration in iterations:
    plt.plot([hit.score for hit in iteration.hits], "o-", label=f"Iteration {iteration.iteration}")

plt.legend()
plt.xlabel("Hit rank")
plt.ylabel("Score (bits)")
plt.show()

With a high cutoff, the search converges quite quickly, but adds 1 new sequence to the ones that were included in the simple search. Let’s see which sequences have been included:

[11]:

for hit in iterations[-1].hits:
    if hit.included:
        print(f"{hit.name:<20}\t{hit.score:5.1f}\t{description_by_name[hit.name]:34}\t{organism_by_name[hit.name][:60]}")

sp|Q5V1N0|BACH_HALMA    505.2   Halorhodopsin                           Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8
sp|Q48314|BACH_HALHS    504.8   Halorhodopsin                           Halobacterium halobium (strain shark)
sp|Q48315|BACH_HALHP    504.7   Halorhodopsin                           Halobacterium halobium (strain port)
sp|P94853|BACH_HALVA    499.5   Cruxhalorhodopsin-3                     Haloarcula vallismortis
sp|B0R2U4|BACH_HALS3    499.5   Halorhodopsin                           Halobacterium salinarum (strain ATCC 29341 / DSM 671 / R1)
sp|P0DMH7|BACH_HALSA    499.5   Halorhodopsin                           Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NR
sp|P33742|BACH_HALSS    497.6   Halorhodopsin                           Halobacterium sp. (strain SG1)
sp|O93742|BACH_HALSD    494.3   Halorhodopsin                           Halorubrum sodomense
sp|O93741|BACH_HALS4    462.7   Halorhodopsin                           Haloterrigena sp. (strain arg-4)
sp|P15647|BACH_NATPH    450.6   Halorhodopsin                           Natronomonas pharaonis
sp|P33970|BACH_HALHM    384.1   Halorhodopsin (Fragment)                Halobacterium halobium (strain mex)
sp|Q53461|BACH_HALAR    261.1   Cruxhalorhodopsin-1 (Fragment)          Haloarcula argentinensis

The last included hit, the Cruxhalorhodopsin-1 fragment, was definitely under inclusion threshold in the simple search. However, the subsequent iterations helped train a new HMM that was able to rescue it, so that it’s now included in the alignment. Let’s have a look at the excluded hits, to make sure we are not missing anything:

[12]:

for hit in iterations[-1].hits:
    if not hit.included and hit.score > 50.0:
        print(f"{hit.name:<20}\t{hit.score:5.1f}\t{description_by_name[hit.name]:34}\t{organism_by_name[hit.name][:60]}")

sp|P02945|BACR_HALSA    120.9   Bacteriorhodopsin                       Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NR
sp|Q53496|BACR_HALSR    118.4   Cruxrhodopsin-2                         Haloarcula sp. (strain arg-2 / Andes heights)
sp|Q5UXY6|BACR1_HALMA   116.3   Bacteriorhodopsin-I                     Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8
sp|P29563|BACR2_HALS2   115.6   Archaerhodopsin-2                       Halobacterium sp. (strain aus-2)
sp|Q57101|BACR_HALAR    113.6   Cruxrhodopsin-1                         Haloarcula argentinensis
sp|P69051|BACR1_HALEZ   112.8   Archaerhodopsin-1                       Halorubrum ezzemoulense
sp|P69052|BACR1_HALSS   112.8   Archaerhodopsin-1                       Halobacterium sp. (strain SG1)
sp|P94854|BACR_HALVA    112.4   Cruxrhodopsin-3                         Haloarcula vallismortis
sp|G0LFX8|BACR1_HALWC   111.6   Bacteriorhodopsin-I                     Haloquadratum walsbyi (strain DSM 16854 / JCM 12705 / C23)
sp|Q18DH8|BACR1_HALWD   111.6   Bacteriorhodopsin-I                     Haloquadratum walsbyi (strain DSM 16790 / HBSQ001)
sp|P96787|BACR3_HALSD   110.2   Archaerhodopsin-3                       Halorubrum sodomense
sp|Q18DH5|BACRM_HALWD   109.6   Bacteriorhodopsin-II-like protein       Haloquadratum walsbyi (strain DSM 16790 / HBSQ001)
sp|Q5V0R5|BACR2_HALMA   106.0   Bacteriorhodopsin-II                    Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8
sp|Q9UW81|NOP1_NEUCR    104.5   Opsin-1                                 Neurospora crassa (strain ATCC 24698 / 74-OR23-1A / CBS 708.
sp|P33972|BACR_HALHS    102.6   Bacteriorhodopsin (Fragment)            Halobacterium halobium (strain shark)
sp|O93740|BACR_HALS4     99.8   Bacteriorhodopsin                       Haloterrigena sp. (strain arg-4)
sp|P33971|BACR_HALHP     99.5   Bacteriorhodopsin (Fragment)            Halobacterium halobium (strain port)
sp|P33969|BACR_HALHM     96.1   Bacteriorhodopsin (Fragment)            Halobacterium halobium (strain mex)
sp|Q70LL3|CARO_FUSFU     76.1   Opsin-like protein carO                 Fusarium fujikuroi
sp|S0ELR6|CARO_GIBF5     76.1   Opsin-like protein carO                 Gibberella fujikuroi (strain CBS 195.34 / IMI 58289 / NRRL A
sp|P42196|BACS2_NATPH    66.5   Sensory rhodopsin-2                     Natronomonas pharaonis
sp|Q48334|BAC3_HALVA     64.4   Bacterial rhodopsin CSR3                Haloarcula vallismortis
sp|Q5V4H7|BACS3_HALMA    63.7   Sensory rhodopsin III                   Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8
sp|Q5UXM9|BACS1_HALMA    62.7   Sensory rhodopsin I                     Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8
sp|P71411|BACS2_HALSA    59.1   Sensory rhodopsin-2                     Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NR
sp|O93743|BACS_HALSD     52.5   Sensory rhodopsin                       Halorubrum sodomense
sp|O74631|FD123_TRAVE    52.0   Protein FDD123                          Trametes versicolor
sp|P42197|BACS2_HALVA    51.1   Sensory rhodopsin-2                     Haloarcula vallismortis
sp|Q5V5V3|BACS2_HALMA    51.1   Sensory rhodopsin II                    Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8

It looks like we are not missing any halorhodopsin in our hits, so looks like we managed to build a sensitive HMM.

Running an iterative search with custom selection#

The previous iterative search helped us create a HMM only covering halorhodopsins. However, let’s try to build a very broad HMM for rhodopsin search in Halobacteria. For this round, we do not care about functional specificity, but we care about taxonomy. In particular, looking at the hits from the last iteration, there are some fungal protein among the archeal ones (the opsin from Neurospora crassa, for instance). To build a HMM specific to archeal proteins we cannot rely on a simple score cutoff, so let’s instead build a dedicated function to select hits between each iteration based on taxonomy.

For handling the taxonomy we will be using taxopy, a Python library for reading NCBI Taxonomy files. Let’s start by loading the files so that we have a taxonomy database at hand:

[13]:

import taxopy
taxdb = taxopy.TaxDb(keep_files=False)

Now let’s write a selection function. We still rely on the score inclusion criterion, but we also will want to exclude any hit from an organism that is not in the Halobacteria class.

[14]:

def select_halobacteria_hits(top_hits):
    for hit in top_hits:
        taxid = taxonomy_by_name[hit.name]
        taxon = taxopy.Taxon(taxid, taxdb)
        if hit.included and taxon.rank_name_dictionary.get('class') != "Halobacteria":
            hit.dropped = True

Now run the iterative search with a lower inclusion threshold, and the selection function we just made:

[15]:

pli = pyhmmer.plan7.Pipeline(alphabet, incT=100, incdomT=100)
search = pli.iterate_seq(hop, swissprot, select_hits=select_halobacteria_hits)

[16]:

iterations = []
while not search.converged:
    iteration = next(search)
    iterations.append(iteration)
    worst_score = min(hit.score for hit in iteration.hits if hit.included)
    print(f"Iteration={iteration.iteration} Hits={len(iteration.hits):3} Included={len(iteration.hits.included):3} Converged={search.converged}")

Iteration=1 Hits= 48 Included= 19 Converged=False
Iteration=2 Hits= 50 Included= 32 Converged=False
Iteration=3 Hits=103 Included= 40 Converged=False
Iteration=4 Hits= 64 Included= 40 Converged=True

The search converges in 4 iterations. Let’s have a look at the score distribution:

[17]:

import matplotlib.pyplot as plt

for iteration in iterations:
    plt.plot([hit.score for hit in iteration.hits], "o-", label=f"Iteration {iteration.iteration}")

plt.legend()
plt.xlabel("Hit rank")
plt.ylabel("Score (bits)")
plt.show()

We can make sure that we are not excluding any rhodopsin from Halobacteria by having a look at the excluded hits from the last iteration:

[18]:

for hit in iterations[-1].hits:
    if not hit.included:
        print(f"{hit.name:<24}\t{hit.score:5.1f}\t{description_by_name[hit.name]:64}\t{organism_by_name[hit.name][:60]}")

sp|Q9UW81|NOP1_NEUCR            163.3   Opsin-1                                                                 Neurospora crassa (strain ATCC 24698 / 74-OR23-1A / CBS 708.
sp|Q70LL3|CARO_FUSFU            162.2   Opsin-like protein carO                                                 Fusarium fujikuroi
sp|S0ELR6|CARO_GIBF5            162.2   Opsin-like protein carO                                                 Gibberella fujikuroi (strain CBS 195.34 / IMI 58289 / NRRL A
sp|O74631|FD123_TRAVE            88.2   Protein FDD123                                                          Trametes versicolor
sp|O74870|MUG73_SCHPO            73.5   Meiotically up-regulated gene 73 protein                                Schizosaccharomyces pombe (strain 972 / ATCC 24843)
sp|Q9F7P4|PRRG_PRB01             60.2   Green-light absorbing proteorhodopsin                                   Gamma-proteobacterium EBAC31A08
sp|Q6J4G7|PRRG_UNKP              60.2   Green-light absorbing proteorhodopsin                                   Unknown prokaryotic organism
sp|Q9AFF7|PRRB_PRB02             54.5   Blue-light absorbing proteorhodopsin                                    Gamma-proteobacterium Hot 75m4
sp|P38079|YRO2_YEAST             49.6   Protein YRO2                                                            Saccharomyces cerevisiae (strain ATCC 204508 / S288c)
sp|Q12117|MRH1_YEAST             41.0   Protein MRH1                                                            Saccharomyces cerevisiae (strain ATCC 204508 / S288c)
sp|P50657|COX3_ANAPL             14.8   Cytochrome c oxidase subunit 3 (Fragment)                               Anas platyrhynchos
sp|O55747|141R_IIV6              13.7   Uncharacterized protein 141R                                            Invertebrate iridescent virus 6
sp|Q89669|GLYCO_ARV              13.5   Glycoprotein                                                            Adelaide River virus
sp|Q01559|HMDH_NICSY             12.6   3-hydroxy-3-methylglutaryl-coenzyme A reductase                         Nicotiana sylvestris
sp|P32595|GLYCO_BEFVB            12.1   Glycoprotein                                                            Bovine ephemeral fever virus (strain BB7721)
sp|A1VUV1|HOA_POLNA              12.0   4-hydroxy-2-oxovalerate aldolase                                        Polaromonas naphthalenivorans (strain CJ2)
sp|P48020|HMDH1_SOLTU            12.0   3-hydroxy-3-methylglutaryl-coenzyme A reductase 1                       Solanum tuberosum
sp|Q8S2G5|GPDH2_ORYSJ            11.7   Probable glycerol-3-phosphate dehydrogenase [NAD(+)] 2, cytosolic       Oryza sativa subsp. japonica
sp|U5JCC6|HMR2B_PANGI            11.7   3-hydroxy-3-methylglutaryl coenzyme A reductase 2-B                     Panax ginseng
sp|P48022|HMDH2_SOLLC            11.7   3-hydroxy-3-methylglutaryl-coenzyme A reductase 2                       Solanum lycopersicum
sp|Q41437|HMDH2_SOLTU            11.6   3-hydroxy-3-methylglutaryl-coenzyme A reductase 2                       Solanum tuberosum
sp|A0A0A1C930|HMR2A_PANGI        11.5   3-hydroxy-3-methylglutaryl coenzyme A reductase 2-A                     Panax ginseng
sp|Q9XEL8|HMDH2_CAPAN            11.4   3-hydroxy-3-methylglutaryl-coenzyme A reductase 2                       Capsicum annuum
sp|Q8S0G4|GPDH1_ORYSJ            11.4   Probable glycerol-3-phosphate dehydrogenase [NAD(+)] 1, cytosolic       Oryza sativa subsp. japonica

Notice that although it is above inclusion threshold, the opsin from Neurospora crassa stays excluded, as expected from our selection function. Regarding the included targets in the final hits, they should only contain rhodopsins from Halobacteria:

[19]:

for hit in iterations[-1].hits:
    if hit.included:
        print(f"{hit.name:<24}\t{hit.score:5.1f}\t{description_by_name[hit.name]:34}\t{organism_by_name[hit.name]}")

sp|B0R2U4|BACH_HALS3            392.0   Halorhodopsin                           Halobacterium salinarum (strain ATCC 29341 / DSM 671 / R1)
sp|P0DMH7|BACH_HALSA            392.0   Halorhodopsin                           Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NRC-1)
sp|Q5V1N0|BACH_HALMA            388.6   Halorhodopsin                           Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809)
sp|Q48314|BACH_HALHS            388.1   Halorhodopsin                           Halobacterium halobium (strain shark)
sp|Q48315|BACH_HALHP            388.0   Halorhodopsin                           Halobacterium halobium (strain port)
sp|P94853|BACH_HALVA            385.6   Cruxhalorhodopsin-3                     Haloarcula vallismortis
sp|P33742|BACH_HALSS            381.5   Halorhodopsin                           Halobacterium sp. (strain SG1)
sp|O93742|BACH_HALSD            376.1   Halorhodopsin                           Halorubrum sodomense
sp|O93741|BACH_HALS4            356.3   Halorhodopsin                           Haloterrigena sp. (strain arg-4)
sp|P15647|BACH_NATPH            352.7   Halorhodopsin                           Natronomonas pharaonis
sp|Q57101|BACR_HALAR            312.2   Cruxrhodopsin-1                         Haloarcula argentinensis
sp|Q5UXY6|BACR1_HALMA           310.4   Bacteriorhodopsin-I                     Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809)
sp|P94854|BACR_HALVA            309.1   Cruxrhodopsin-3                         Haloarcula vallismortis
sp|P29563|BACR2_HALS2           303.1   Archaerhodopsin-2                       Halobacterium sp. (strain aus-2)
sp|G0LFX8|BACR1_HALWC           301.9   Bacteriorhodopsin-I                     Haloquadratum walsbyi (strain DSM 16854 / JCM 12705 / C23)
sp|Q18DH8|BACR1_HALWD           301.9   Bacteriorhodopsin-I                     Haloquadratum walsbyi (strain DSM 16790 / HBSQ001)
sp|P69051|BACR1_HALEZ           301.3   Archaerhodopsin-1                       Halorubrum ezzemoulense
sp|P69052|BACR1_HALSS           301.3   Archaerhodopsin-1                       Halobacterium sp. (strain SG1)
sp|Q53496|BACR_HALSR            299.7   Cruxrhodopsin-2                         Haloarcula sp. (strain arg-2 / Andes heights)
sp|P96787|BACR3_HALSD           299.0   Archaerhodopsin-3                       Halorubrum sodomense
sp|Q5V0R5|BACR2_HALMA           293.1   Bacteriorhodopsin-II                    Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809)
sp|P02945|BACR_HALSA            288.0   Bacteriorhodopsin                       Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NRC-1)
sp|P33970|BACH_HALHM            283.2   Halorhodopsin (Fragment)                Halobacterium halobium (strain mex)
sp|O93740|BACR_HALS4            282.3   Bacteriorhodopsin                       Haloterrigena sp. (strain arg-4)
sp|P33971|BACR_HALHP            278.8   Bacteriorhodopsin (Fragment)            Halobacterium halobium (strain port)
sp|Q18DH5|BACRM_HALWD           277.0   Bacteriorhodopsin-II-like protein       Haloquadratum walsbyi (strain DSM 16790 / HBSQ001)
sp|P33972|BACR_HALHS            275.9   Bacteriorhodopsin (Fragment)            Halobacterium halobium (strain shark)
sp|P33969|BACR_HALHM            262.3   Bacteriorhodopsin (Fragment)            Halobacterium halobium (strain mex)
sp|O93743|BACS_HALSD            234.6   Sensory rhodopsin                       Halorubrum sodomense
sp|P33743|BACS1_HALSS           234.4   Sensory rhodopsin-1                     Halobacterium sp. (strain SG1)
sp|P42196|BACS2_NATPH           230.7   Sensory rhodopsin-2                     Natronomonas pharaonis
sp|Q48334|BAC3_HALVA            229.2   Bacterial rhodopsin CSR3                Haloarcula vallismortis
sp|Q5UXM9|BACS1_HALMA           222.8   Sensory rhodopsin I                     Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809)
sp|B0R633|BACS1_HALS3           222.5   Sensory rhodopsin-1                     Halobacterium salinarum (strain ATCC 29341 / DSM 671 / R1)
sp|P0DMH8|BACS1_HALSA           222.5   Sensory rhodopsin-1                     Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NRC-1)
sp|P42197|BACS2_HALVA           221.1   Sensory rhodopsin-2                     Haloarcula vallismortis
sp|Q5V5V3|BACS2_HALMA           221.1   Sensory rhodopsin II                    Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809)
sp|P71411|BACS2_HALSA           218.6   Sensory rhodopsin-2                     Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NRC-1)
sp|Q5V4H7|BACS3_HALMA           205.9   Sensory rhodopsin III                   Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809)
sp|Q53461|BACH_HALAR            202.7   Cruxhalorhodopsin-1 (Fragment)          Haloarcula argentinensis

Recording bitscore cutoffs#

A useful feature of HMMER is the possibility to use HMM-specific bitscore cutoffs when searching sequences with a HMM, instead of relying on E-value cutoffs. Three cutoffs categories are available:

Noise cutoffs (NC): The least stringent of the 3 cutoffs, normally set as the highest score seen for negative sequences when gathering members of full alignments.
Gathering cutoffs (GA): More stringent than noise cutoffs, normally computed as the cutoffs used when gathering members of full alignments.
Trusted cutoffs (TC): The most stringent of the 3 cutoffs, normally set according to the lowest scores seen for true homologous sequences that were above the GA gathering thresholds, when gathering members of full alignments.

[20]:

hmm = iterations[-1].hmm

Let’s first build noise cutoffs based on the negative hits: we can use the highest scoring excluded sequence.

[21]:

from math import ceil

noise_score = max(hit.score for hit in iterations[-1].hits if not hit.included)
noise_domscore = max(hit.best_domain.score for hit in iterations[-1].hits if not hit.included)
hmm.cutoffs.noise = ceil(noise_score), ceil(noise_domscore)

For gathering, we can use the lowest scoring included sequence:

[22]:

from math import floor

gathering_score = min(hit.score for hit in iterations[-1].hits if hit.included)
gathering_domscore = min(hit.best_domain.score for hit in iterations[-1].hits if hit.included)
hmm.cutoffs.gathering = floor(gathering_score), floor(gathering_domscore)

For trusted, it’s a little bit less obvious. If you look at the bitscore distribution from the last iteration, we can see 3 distinct group of hits, one in the 390-350 range, one in the 315-260 range, and one in the 235-200 range. We may want to set the trusted cutoffs so that the only the hits in the top range can be trusted.

[23]:

from math import floor

i = max(i for i, hit in enumerate(iterations[-1].hits) if hit.score > 350)
hmm.cutoffs.trusted = floor(iterations[-1].hits[i].score), floor(iterations[-1].hits[i].best_domain.score)

Otherwise, for a more unsupervised approach, we could try to extract a cutoffs by selecting the score of the hits where the derivative is at its maximum:

[24]:

diff = [
    iterations[-1].hits[i-1].score - iterations[-1].hits[i].score
    for i in range(1, len(iterations[-1].hits))
    if iterations[-1].hits[i-1].included
]

[25]:

import matplotlib.pyplot as plt

plt.plot(diff)
plt.ylabel("dscore")
plt.xlabel("Hit rank")
plt.show()

[26]:

from math import floor

argmax = max(range(len(diff)), key=diff.__getitem__)
hit = iterations[-1].hits[argmax]
hmm.cutoffs.trusted = floor(hit.score), floor(hit.best_domain.score)

Saving the HMM#

Now that we are done creating the HMM, we can save it for later use:

[27]:

hmm.name = b"Rhodopsin"
hmm.description = None
hmm.accession = None

[28]:

with open("rhodopsin.hmm", "wb") as dst_file:
    hmm.write(dst_file)