Run an iterative search to build a HMM for rhodopsins¶
This examples shows an example workflow for building an HMM from a seed sequence through an iterative search in a sequence database, similar to the JackHMMER methodology. In the second half it will also introduce extra capabilities of the Python wrapper allowing the dynamic modification of the search parameters.
The goal of this example is to build a HMM specific to halorhodopsins, a class of light-activated ion pumps found in Archaea. The difficulty of that task comes from their similarity to bacteriorhodopsins, which have a comparable conserved structure, but different activity.
Figure from Zhang et al. (2011).
In a second time we will then build a broader HMM for any kind of rhodopsin, while introducing how to implement a custom hit selection function. This will allow for the automatic exclusion of false positives coming from other organisms, using an additional filter based on taxonomy.
References
[1]:
import pyhmmer
pyhmmer.__version__
[1]:
'0.10.4'
Downloading sequence data¶
For this example workflow we will be using SwissProt, directly downloading the sequences in FASTA format from the EMBL-EBI data server. The remote file is compressed, however we can decompress the sequences on the fly using the gzip
module of the Python standard library.
[2]:
import gzip
import urllib.request
alphabet = pyhmmer.easel.Alphabet.amino()
url = "http://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.fasta.gz"
with urllib.request.urlopen(url) as response:
reader = gzip.open(response, "rb")
with pyhmmer.easel.SequenceFile(reader, digital=True, alphabet=alphabet) as seq_file:
swissprot = seq_file.read_block()
The SwissProt sequences all contain the source organism and NCBI Taxonomy accession in their FASTA header, and we will need these pieces of information later. For now, we can build an index that maps a sequence name to its source organism, and another index for taxonomy:
[3]:
import re
taxonomy_by_name = {}
organism_by_name = {}
description_by_name = {}
for seq in swissprot:
match = re.search(b"(.*) OS=(.*) OX=(\d+)", seq.description)
taxonomy_by_name[seq.name] = int(match.group(3))
organism_by_name[seq.name] = match.group(2)
description_by_name[seq.name] = match.group(1)
Running a simple search¶
The sequence we will be using for initial similarity search is the halorhodopsin from Halobacterium salinarum ATCC 29341. First let’s extract it from SwissProt:
[4]:
hop = next(x for x in swissprot if b'B0R2U4' in x.name)
Now let’s see if we can find close homologs to this one protein without losing the specificity. For this, we first need to create a pipeline and to search for the sequence above:
[5]:
pli = pyhmmer.plan7.Pipeline(alphabet)
hits = pli.search_seq(hop, swissprot)
We can visualize the score distribution for the hits to see what we should consider as the inclusion threshold:
[6]:
import matplotlib.pyplot as plt
plt.plot([hit.score for hit in hits], "o-")
plt.xlabel("Hit rank")
plt.ylabel("Score (bits)")
plt.show()
From this graph we can identify three groups of protein in the hits:
The two first hits, with very high scores, that are likely very similar orthologs
A group of more remote orthologs, with a score above 300 bits
A group of very remote orthologs, under the 300 bits score mark.
If we went to run an iterative search with the default Pipeline
parameters, all of these hits would be included in the next iteration, and we may lose the specifity. Let’s check the organisms included with a bitscore cutoff of 300:
[7]:
for hit in hits:
if hit.score >= 100.0:
print(f"{hit.name.decode():<20}\t{hit.score:5.1f}\t{description_by_name[hit.name].decode():34}\t{organism_by_name[hit.name].decode()}")
sp|B0R2U4|BACH_HALS3 563.4 Halorhodopsin Halobacterium salinarum (strain ATCC 29341 / DSM 671 / R1)
sp|P0DMH7|BACH_HALSA 563.4 Halorhodopsin Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NRC-1)
sp|O93741|BACH_HALS4 377.0 Halorhodopsin Haloterrigena sp. (strain arg-4)
sp|P94853|BACH_HALVA 353.0 Cruxhalorhodopsin-3 Haloarcula vallismortis
sp|P33742|BACH_HALSS 351.7 Halorhodopsin Halobacterium sp. (strain SG1)
sp|Q5V1N0|BACH_HALMA 348.2 Halorhodopsin Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809)
sp|Q48315|BACH_HALHP 348.0 Halorhodopsin Halobacterium halobium (strain port)
sp|Q48314|BACH_HALHS 347.7 Halorhodopsin Halobacterium halobium (strain shark)
sp|O93742|BACH_HALSD 345.1 Halorhodopsin Halorubrum sodomense
sp|P15647|BACH_NATPH 286.5 Halorhodopsin Natronomonas pharaonis
sp|P33970|BACH_HALHM 266.8 Halorhodopsin (Fragment) Halobacterium halobium (strain mex)
sp|Q53461|BACH_HALAR 184.0 Cruxhalorhodopsin-1 (Fragment) Haloarcula argentinensis
sp|P69051|BACR1_HALEZ 103.3 Archaerhodopsin-1 Halorubrum ezzemoulense
sp|P69052|BACR1_HALSS 103.3 Archaerhodopsin-1 Halobacterium sp. (strain SG1)
sp|P02945|BACR_HALSA 103.0 Bacteriorhodopsin Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NRC-1)
sp|P96787|BACR3_HALSD 102.0 Archaerhodopsin-3 Halorubrum sodomense
sp|P29563|BACR2_HALS2 101.7 Archaerhodopsin-2 Halobacterium sp. (strain aus-2)
sp|G0LFX8|BACR1_HALWC 100.2 Bacteriorhodopsin-I Haloquadratum walsbyi (strain DSM 16854 / JCM 12705 / C23)
sp|Q18DH8|BACR1_HALWD 100.2 Bacteriorhodopsin-I Haloquadratum walsbyi (strain DSM 16790 / HBSQ001)
Looks like all of these are indeed Halobacteria, so we should be fine running an iterative search with these. However, two hits under the 300 bits are still halorhodopsins according to UniProt annotations, so to keep them included we should lower the inclusion threshold to 250 bits.
Running a conservative iterative search¶
Let’s run an iterative search until it converges with a bitscore cutoff of 250, as we selected previously. For this we need to create a new pipeline, and use the Pipeline.iterate_seq
method with our query and target sequences:
[8]:
pli = pyhmmer.plan7.Pipeline(alphabet, incT=250, incdomT=250)
search = pli.iterate_seq(hop, swissprot)
Now that we have an iterator, we can keep iterating until it converges (in this context, converging means that no new hits are found by the newly created HMM in comparison to the previous one):
[9]:
iterations = []
while not search.converged:
iteration = next(search)
iterations.append(iteration)
print(f"Iteration={iteration.iteration} Hits={len(iteration.hits):2} Included={len(iteration.hits.included):2} Converged={search.converged}")
Iteration=1 Hits=48 Included=11 Converged=False
Iteration=2 Hits=53 Included=12 Converged=False
Iteration=3 Hits=53 Included=12 Converged=True
We can plot the score distribution for each iteration:
[10]:
import matplotlib.pyplot as plt
for iteration in iterations:
plt.plot([hit.score for hit in iteration.hits], "o-", label=f"Iteration {iteration.iteration}")
plt.legend()
plt.xlabel("Hit rank")
plt.ylabel("Score (bits)")
plt.show()
With a high cutoff, the search converges quite quickly, but adds 1 new sequence to the ones that were included in the simple search. Let’s see which sequences have been included:
[11]:
for hit in iterations[-1].hits:
if hit.included:
print(f"{hit.name.decode():<20}\t{hit.score:5.1f}\t{description_by_name[hit.name].decode():34}\t{organism_by_name[hit.name][:60].decode()}")
sp|Q5V1N0|BACH_HALMA 505.2 Halorhodopsin Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8
sp|Q48314|BACH_HALHS 504.8 Halorhodopsin Halobacterium halobium (strain shark)
sp|Q48315|BACH_HALHP 504.7 Halorhodopsin Halobacterium halobium (strain port)
sp|P94853|BACH_HALVA 499.5 Cruxhalorhodopsin-3 Haloarcula vallismortis
sp|B0R2U4|BACH_HALS3 499.5 Halorhodopsin Halobacterium salinarum (strain ATCC 29341 / DSM 671 / R1)
sp|P0DMH7|BACH_HALSA 499.5 Halorhodopsin Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NR
sp|P33742|BACH_HALSS 497.6 Halorhodopsin Halobacterium sp. (strain SG1)
sp|O93742|BACH_HALSD 494.3 Halorhodopsin Halorubrum sodomense
sp|O93741|BACH_HALS4 462.7 Halorhodopsin Haloterrigena sp. (strain arg-4)
sp|P15647|BACH_NATPH 450.6 Halorhodopsin Natronomonas pharaonis
sp|P33970|BACH_HALHM 384.1 Halorhodopsin (Fragment) Halobacterium halobium (strain mex)
sp|Q53461|BACH_HALAR 261.1 Cruxhalorhodopsin-1 (Fragment) Haloarcula argentinensis
The last included hit, the Cruxhalorhodopsin-1 fragment, was definitely under inclusion threshold in the simple search. However, the subsequent iterations helped train a new HMM that was able to rescue it, so that it’s now included in the alignment. Let’s have a look at the excluded hits, to make sure we are not missing anything:
[12]:
for hit in iterations[-1].hits:
if not hit.included and hit.score > 50.0:
print(f"{hit.name.decode():<20}\t{hit.score:5.1f}\t{description_by_name[hit.name].decode():34}\t{organism_by_name[hit.name][:60].decode()}")
sp|P02945|BACR_HALSA 120.9 Bacteriorhodopsin Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NR
sp|Q53496|BACR_HALSR 118.4 Cruxrhodopsin-2 Haloarcula sp. (strain arg-2 / Andes heights)
sp|Q5UXY6|BACR1_HALMA 116.3 Bacteriorhodopsin-I Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8
sp|P29563|BACR2_HALS2 115.6 Archaerhodopsin-2 Halobacterium sp. (strain aus-2)
sp|Q57101|BACR_HALAR 113.6 Cruxrhodopsin-1 Haloarcula argentinensis
sp|P69051|BACR1_HALEZ 112.8 Archaerhodopsin-1 Halorubrum ezzemoulense
sp|P69052|BACR1_HALSS 112.8 Archaerhodopsin-1 Halobacterium sp. (strain SG1)
sp|P94854|BACR_HALVA 112.4 Cruxrhodopsin-3 Haloarcula vallismortis
sp|G0LFX8|BACR1_HALWC 111.6 Bacteriorhodopsin-I Haloquadratum walsbyi (strain DSM 16854 / JCM 12705 / C23)
sp|Q18DH8|BACR1_HALWD 111.6 Bacteriorhodopsin-I Haloquadratum walsbyi (strain DSM 16790 / HBSQ001)
sp|P96787|BACR3_HALSD 110.2 Archaerhodopsin-3 Halorubrum sodomense
sp|Q18DH5|BACRM_HALWD 109.6 Bacteriorhodopsin-II-like protein Haloquadratum walsbyi (strain DSM 16790 / HBSQ001)
sp|Q5V0R5|BACR2_HALMA 106.0 Bacteriorhodopsin-II Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8
sp|Q9UW81|NOP1_NEUCR 104.5 Opsin-1 Neurospora crassa (strain ATCC 24698 / 74-OR23-1A / CBS 708.
sp|P33972|BACR_HALHS 102.6 Bacteriorhodopsin (Fragment) Halobacterium halobium (strain shark)
sp|O93740|BACR_HALS4 99.8 Bacteriorhodopsin Haloterrigena sp. (strain arg-4)
sp|P33971|BACR_HALHP 99.5 Bacteriorhodopsin (Fragment) Halobacterium halobium (strain port)
sp|P33969|BACR_HALHM 96.1 Bacteriorhodopsin (Fragment) Halobacterium halobium (strain mex)
sp|Q70LL3|CARO_GIBFU 76.1 Opsin-like protein carO Gibberella fujikuroi
sp|S0ELR6|CARO_GIBF5 76.1 Opsin-like protein carO Gibberella fujikuroi (strain CBS 195.34 / IMI 58289 / NRRL A
sp|P42196|BACS2_NATPH 66.5 Sensory rhodopsin-2 Natronomonas pharaonis
sp|Q48334|BAC3_HALVA 64.4 Bacterial rhodopsin CSR3 Haloarcula vallismortis
sp|Q5V4H7|BACS3_HALMA 63.7 Sensory rhodopsin III Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8
sp|Q5UXM9|BACS1_HALMA 62.7 Sensory rhodopsin I Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8
sp|P71411|BACS2_HALSA 59.1 Sensory rhodopsin-2 Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NR
sp|O93743|BACS_HALSD 52.5 Sensory rhodopsin Halorubrum sodomense
sp|O74631|FD123_TRAVE 52.0 Protein FDD123 Trametes versicolor
sp|P42197|BACS2_HALVA 51.1 Sensory rhodopsin-2 Haloarcula vallismortis
sp|Q5V5V3|BACS2_HALMA 51.1 Sensory rhodopsin II Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8
It looks like we are not missing any halorhodopsin in our hits, so looks like we managed to build a sensitive HMM.
Running an iterative search with custom selection¶
The previous iterative search helped us create a HMM only covering halorhodopsins. However, let’s try to build a very broad HMM for rhodopsin search in Halobacteria. For this round, we do not care about functional specificity, but we care about taxonomy. In particular, looking at the hits from the last iteration, there are some fungal protein among the archeal ones (the opsin from Neurospora crassa, for instance). To build a HMM specific to archeal proteins we cannot rely on a simple score cutoff, so let’s instead build a dedicated function to select hits between each iteration based on taxonomy.
For handling the taxonomy we will be using taxopy, a Python library for reading NCBI Taxonomy files. Let’s start by loading the files so that we have a taxonomy database at hand:
[13]:
import taxopy
taxdb = taxopy.TaxDb(keep_files=False)
Now let’s write a selection function. We still rely on the score inclusion criterion, but we also will want to exclude any hit from an organism that is not in the Halobacteria class.
[14]:
def select_halobacteria_hits(top_hits):
for hit in top_hits:
taxid = taxonomy_by_name[hit.name]
taxon = taxopy.Taxon(taxid, taxdb)
if hit.included and taxon.rank_name_dictionary.get('class') != "Halobacteria":
hit.dropped = True
Now run the iterative search with a lower inclusion threshold, and the selection function we just made:
[15]:
pli = pyhmmer.plan7.Pipeline(alphabet, incT=100, incdomT=100)
search = pli.iterate_seq(hop, swissprot, select_hits=select_halobacteria_hits)
[16]:
iterations = []
while not search.converged:
iteration = next(search)
iterations.append(iteration)
worst_score = min(hit.score for hit in iteration.hits if hit.included)
print(f"Iteration={iteration.iteration} Hits={len(iteration.hits):3} Included={len(iteration.hits.included):3} Converged={search.converged}")
Iteration=1 Hits= 48 Included= 19 Converged=False
Iteration=2 Hits= 50 Included= 32 Converged=False
Iteration=3 Hits=103 Included= 40 Converged=False
Iteration=4 Hits= 64 Included= 40 Converged=True
The search converges in 4 iterations. Let’s have a look at the score distribution:
[17]:
import matplotlib.pyplot as plt
for iteration in iterations:
plt.plot([hit.score for hit in iteration.hits], "o-", label=f"Iteration {iteration.iteration}")
plt.legend()
plt.xlabel("Hit rank")
plt.ylabel("Score (bits)")
plt.show()
We can make sure that we are not excluding any rhodopsin from Halobacteria by having a look at the excluded hits from the last iteration:
[18]:
for hit in iterations[-1].hits:
if not hit.included:
print(f"{hit.name.decode():<24}\t{hit.score:5.1f}\t{description_by_name[hit.name].decode():64}\t{organism_by_name[hit.name][:60].decode()}")
sp|Q9UW81|NOP1_NEUCR 163.3 Opsin-1 Neurospora crassa (strain ATCC 24698 / 74-OR23-1A / CBS 708.
sp|Q70LL3|CARO_GIBFU 162.2 Opsin-like protein carO Gibberella fujikuroi
sp|S0ELR6|CARO_GIBF5 162.2 Opsin-like protein carO Gibberella fujikuroi (strain CBS 195.34 / IMI 58289 / NRRL A
sp|O74631|FD123_TRAVE 88.2 Protein FDD123 Trametes versicolor
sp|O74870|MUG73_SCHPO 73.5 Meiotically up-regulated gene 73 protein Schizosaccharomyces pombe (strain 972 / ATCC 24843)
sp|Q9F7P4|PRRG_PRB01 60.2 Green-light absorbing proteorhodopsin Gamma-proteobacterium EBAC31A08
sp|Q6J4G7|PRRG_UNKP 60.2 Green-light absorbing proteorhodopsin Unknown prokaryotic organism
sp|Q9AFF7|PRRB_PRB02 54.5 Blue-light absorbing proteorhodopsin Gamma-proteobacterium Hot 75m4
sp|P38079|YRO2_YEAST 49.6 Protein YRO2 Saccharomyces cerevisiae (strain ATCC 204508 / S288c)
sp|Q12117|MRH1_YEAST 41.0 Protein MRH1 Saccharomyces cerevisiae (strain ATCC 204508 / S288c)
sp|P50657|COX3_ANAPL 14.8 Cytochrome c oxidase subunit 3 (Fragment) Anas platyrhynchos
sp|O55747|141R_IIV6 13.7 Uncharacterized protein 141R Invertebrate iridescent virus 6
sp|Q89669|GLYCO_ARV 13.5 Glycoprotein Adelaide River virus
sp|Q01559|HMDH_NICSY 12.6 3-hydroxy-3-methylglutaryl-coenzyme A reductase Nicotiana sylvestris
sp|P32595|GLYCO_BEFVB 12.1 Glycoprotein Bovine ephemeral fever virus (strain BB7721)
sp|A1VUV1|HOA_POLNA 12.0 4-hydroxy-2-oxovalerate aldolase Polaromonas naphthalenivorans (strain CJ2)
sp|P48020|HMDH1_SOLTU 12.0 3-hydroxy-3-methylglutaryl-coenzyme A reductase 1 Solanum tuberosum
sp|Q8S2G5|GPDH2_ORYSJ 11.7 Probable glycerol-3-phosphate dehydrogenase [NAD(+)] 2, cytosolic Oryza sativa subsp. japonica
sp|U5JCC6|HMR2B_PANGI 11.7 3-hydroxy-3-methylglutaryl coenzyme A reductase 2-B Panax ginseng
sp|P48022|HMDH2_SOLLC 11.7 3-hydroxy-3-methylglutaryl-coenzyme A reductase 2 Solanum lycopersicum
sp|Q41437|HMDH2_SOLTU 11.6 3-hydroxy-3-methylglutaryl-coenzyme A reductase 2 Solanum tuberosum
sp|A0A0A1C930|HMR2A_PANGI 11.5 3-hydroxy-3-methylglutaryl coenzyme A reductase 2-A Panax ginseng
sp|Q9XEL8|HMDH2_CAPAN 11.4 3-hydroxy-3-methylglutaryl-coenzyme A reductase 2 Capsicum annuum
sp|Q8S0G4|GPDH1_ORYSJ 11.4 Probable glycerol-3-phosphate dehydrogenase [NAD(+)] 1, cytosolic Oryza sativa subsp. japonica
Notice that although it is above inclusion threshold, the opsin from Neurospora crassa stays excluded, as expected from our selection function. Regarding the included targets in the final hits, they should only contain rhodopsins from Halobacteria:
[19]:
for hit in iterations[-1].hits:
if hit.included:
print(f"{hit.name.decode():<24}\t{hit.score:5.1f}\t{description_by_name[hit.name].decode():34}\t{organism_by_name[hit.name].decode()}")
sp|B0R2U4|BACH_HALS3 392.0 Halorhodopsin Halobacterium salinarum (strain ATCC 29341 / DSM 671 / R1)
sp|P0DMH7|BACH_HALSA 392.0 Halorhodopsin Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NRC-1)
sp|Q5V1N0|BACH_HALMA 388.6 Halorhodopsin Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809)
sp|Q48314|BACH_HALHS 388.1 Halorhodopsin Halobacterium halobium (strain shark)
sp|Q48315|BACH_HALHP 388.0 Halorhodopsin Halobacterium halobium (strain port)
sp|P94853|BACH_HALVA 385.6 Cruxhalorhodopsin-3 Haloarcula vallismortis
sp|P33742|BACH_HALSS 381.5 Halorhodopsin Halobacterium sp. (strain SG1)
sp|O93742|BACH_HALSD 376.1 Halorhodopsin Halorubrum sodomense
sp|O93741|BACH_HALS4 356.3 Halorhodopsin Haloterrigena sp. (strain arg-4)
sp|P15647|BACH_NATPH 352.7 Halorhodopsin Natronomonas pharaonis
sp|Q57101|BACR_HALAR 312.2 Cruxrhodopsin-1 Haloarcula argentinensis
sp|Q5UXY6|BACR1_HALMA 310.4 Bacteriorhodopsin-I Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809)
sp|P94854|BACR_HALVA 309.1 Cruxrhodopsin-3 Haloarcula vallismortis
sp|P29563|BACR2_HALS2 303.1 Archaerhodopsin-2 Halobacterium sp. (strain aus-2)
sp|G0LFX8|BACR1_HALWC 301.9 Bacteriorhodopsin-I Haloquadratum walsbyi (strain DSM 16854 / JCM 12705 / C23)
sp|Q18DH8|BACR1_HALWD 301.9 Bacteriorhodopsin-I Haloquadratum walsbyi (strain DSM 16790 / HBSQ001)
sp|P69051|BACR1_HALEZ 301.3 Archaerhodopsin-1 Halorubrum ezzemoulense
sp|P69052|BACR1_HALSS 301.3 Archaerhodopsin-1 Halobacterium sp. (strain SG1)
sp|Q53496|BACR_HALSR 299.7 Cruxrhodopsin-2 Haloarcula sp. (strain arg-2 / Andes heights)
sp|P96787|BACR3_HALSD 299.0 Archaerhodopsin-3 Halorubrum sodomense
sp|Q5V0R5|BACR2_HALMA 293.1 Bacteriorhodopsin-II Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809)
sp|P02945|BACR_HALSA 288.0 Bacteriorhodopsin Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NRC-1)
sp|P33970|BACH_HALHM 283.2 Halorhodopsin (Fragment) Halobacterium halobium (strain mex)
sp|O93740|BACR_HALS4 282.3 Bacteriorhodopsin Haloterrigena sp. (strain arg-4)
sp|P33971|BACR_HALHP 278.8 Bacteriorhodopsin (Fragment) Halobacterium halobium (strain port)
sp|Q18DH5|BACRM_HALWD 277.0 Bacteriorhodopsin-II-like protein Haloquadratum walsbyi (strain DSM 16790 / HBSQ001)
sp|P33972|BACR_HALHS 275.9 Bacteriorhodopsin (Fragment) Halobacterium halobium (strain shark)
sp|P33969|BACR_HALHM 262.3 Bacteriorhodopsin (Fragment) Halobacterium halobium (strain mex)
sp|O93743|BACS_HALSD 234.6 Sensory rhodopsin Halorubrum sodomense
sp|P33743|BACS1_HALSS 234.4 Sensory rhodopsin-1 Halobacterium sp. (strain SG1)
sp|P42196|BACS2_NATPH 230.7 Sensory rhodopsin-2 Natronomonas pharaonis
sp|Q48334|BAC3_HALVA 229.2 Bacterial rhodopsin CSR3 Haloarcula vallismortis
sp|Q5UXM9|BACS1_HALMA 222.8 Sensory rhodopsin I Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809)
sp|B0R633|BACS1_HALS3 222.5 Sensory rhodopsin-1 Halobacterium salinarum (strain ATCC 29341 / DSM 671 / R1)
sp|P0DMH8|BACS1_HALSA 222.5 Sensory rhodopsin-1 Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NRC-1)
sp|P42197|BACS2_HALVA 221.1 Sensory rhodopsin-2 Haloarcula vallismortis
sp|Q5V5V3|BACS2_HALMA 221.1 Sensory rhodopsin II Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809)
sp|P71411|BACS2_HALSA 218.6 Sensory rhodopsin-2 Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NRC-1)
sp|Q5V4H7|BACS3_HALMA 205.9 Sensory rhodopsin III Haloarcula marismortui (strain ATCC 43049 / DSM 3752 / JCM 8966 / VKM B-1809)
sp|Q53461|BACH_HALAR 202.7 Cruxhalorhodopsin-1 (Fragment) Haloarcula argentinensis
Recording bitscore cutoffs¶
A useful feature of HMMER is the possibility to use HMM-specific bitscore cutoffs when searching sequences with a HMM, instead of relying on E-value cutoffs. Three cutoffs categories are available:
Noise cutoffs (NC): The least stringent of the 3 cutoffs, normally set as the highest score seen for negative sequences when gathering members of full alignments.
Gathering cutoffs (GA): More stringent than noise cutoffs, normally computed as the cutoffs used when gathering members of full alignments.
Trusted cutoffs (TC): The most stringent of the 3 cutoffs, normally set according to the lowest scores seen for true homologous sequences that were above the GA gathering thresholds, when gathering members of full alignments.
[20]:
hmm = iterations[-1].hmm
Let’s first build noise cutoffs based on the negative hits: we can use the highest scoring excluded sequence.
[21]:
from math import ceil
noise_score = max(hit.score for hit in iterations[-1].hits if not hit.included)
noise_domscore = max(hit.best_domain.score for hit in iterations[-1].hits if not hit.included)
hmm.cutoffs.noise = ceil(noise_score), ceil(noise_domscore)
For gathering, we can use the lowest scoring included sequence:
[22]:
from math import floor
gathering_score = min(hit.score for hit in iterations[-1].hits if hit.included)
gathering_domscore = min(hit.best_domain.score for hit in iterations[-1].hits if hit.included)
hmm.cutoffs.gathering = floor(gathering_score), floor(gathering_domscore)
For trusted, it’s a little bit less obvious. If you look at the bitscore distribution from the last iteration, we can see 3 distinct group of hits, one in the 390-350 range, one in the 315-260 range, and one in the 235-200 range. We may want to set the trusted cutoffs so that the only the hits in the top range can be trusted.
[23]:
from math import floor
i = max(i for i, hit in enumerate(iterations[-1].hits) if hit.score > 350)
hmm.cutoffs.trusted = floor(iterations[-1].hits[i].score), floor(iterations[-1].hits[i].best_domain.score)
Otherwise, for a more unsupervised approach, we could try to extract a cutoffs by selecting the score of the hits where the derivative is at its maximum:
[24]:
diff = [
iterations[-1].hits[i-1].score - iterations[-1].hits[i].score
for i in range(1, len(iterations[-1].hits))
if iterations[-1].hits[i-1].included
]
[25]:
import matplotlib.pyplot as plt
plt.plot(diff)
plt.ylabel("dscore")
plt.xlabel("Hit rank")
plt.show()
[26]:
from math import floor
argmax = max(range(len(diff)), key=diff.__getitem__)
hit = iterations[-1].hits[argmax]
hmm.cutoffs.trusted = floor(hit.score), floor(hit.best_domain.score)
Saving the HMM¶
Now that we are done creating the HMM, we can save it for later use:
[27]:
hmm.name = b"Rhodopsin"
hmm.description = None
hmm.accession = None
[28]:
with open("rhodopsin.hmm", "wb") as dst_file:
hmm.write(dst_file)