Build an HMM from a multiple sequence alignment#
[1]:
import pyhmmer
pyhmmer.__version__
[1]:
'0.10.14'
[2]:
alphabet = pyhmmer.easel.Alphabet.amino()
Loading the alignment#
A new HMM can be built from a single sequence, or from a multiple sequence alignment. Let’s load an alignment in digital mode so that we can build our HMM. We can use a MSAFile
to read an alignment from a file in one of the supported formats (such as Stockholm, Clustal, aligned
FASTA, etc; see the MSAFile
documentation for the complete list):
[3]:
with pyhmmer.easel.MSAFile("data/msa/LuxC.sto", digital=True, alphabet=alphabet) as msa_file:
msa = msa_file.read()
Note
In this example, we load a multiple sequence alignment from a file, but if your program produces alignment and you wish to make an HMM out of them, you can instantiate a TextMSA
object yourself, e.g.:
seq1 = pyhmmer.easel.TextSequence(name=b"seq1", sequence="WVPKQDFT")
seq2 = pyhmmer.easel.TextSequence(name=b"seq2", sequence="WL--PQGE")
msa = pyhmmer.easel.TextMSA(name=b"msa", sequences=[seq1, seq2])
Because we need a DigitalMSA
to build the HMM, you will have to convert it first:
msa_d = msa.digitize(alphabet)
Building an HMM#
Now that we have a multiple alignment loaded in memory, we can build a pHMM using a pyhmmer.plan7.Builder
. This also requires a Plan7 background model to compute the transition probabilities.
[4]:
builder = pyhmmer.plan7.Builder(alphabet)
background = pyhmmer.plan7.Background(alphabet)
hmm, _, _ = builder.build_msa(msa, background)
Warning
HMMs must always have a name. The Builder
will attempt to use the MSA
name, but in the event your alignment file does not contain a name for the whole alignment (for instance, a #=GF ID
line in a Stockholm file), this method will fail:
>>> anonymous_msa = pyhmmer.easel.TextMSA(sequences=[seq1, seq2])
>>> builder.build_msa(anonymous_msa)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: Could not build HMM: Unable to name the HMM.
If your alignment is missing a name, you can assign one manually by setting the MSA.name
property, e.g.:
msa.name = b"protein_family_01"
We can have a look at the consensus sequence of the HMM with the consensus
property:
[5]:
hmm.consensus
[5]:
'lanlkleeildlleevaqrlkdeeysrryirelakilgyeeemlkalkalmallskeaLkdllereLgqpeildeflprkesyekaqpkglvvhllagNvpllpvmsileglLvknvnllkvSssdpflaaallksladvdadhtlarsisvvywkssdtkleeeivqnaDvviawGGeeAikaivkklkpgvdlikfGpkiSlavvdkeaalekaaeavAkDicvydQqAClSpqvvfvesddeaklrefaeeLaeaLekrakilPkaelsideaaaisskraeaklkyllseekkvvsekdqkwtvilseeqellnsPlsrtvnvkavpdiedvveyvtknrtlQtvglavkeselkelaeklaaaGverivevGemnlfrsGephDgvyaLqrlvrl'
Saving the resulting HMM#
Now that we have an HMM, we can save it to a file to avoid having to rebuild it every time. Using the HMM.write
method lets us write the HMM in ASCII or binary format to an arbitrary file. The resulting file will also be compatible with the hmmsearch
binary if you wish to use that one instead of PyHMMER.
[6]:
with open("LuxC.hmm", "wb") as output_file:
hmm.write(output_file)
Applying the HMM to a sequence database#
Once a pHMM has been obtained, it can be applied to a sequence database with the pyhmmer.plan7.Pipeline
object. Let’s iterate over the protein sequences in a FASTA to see if our new HMM gets any hits:
[7]:
pipeline = pyhmmer.plan7.Pipeline(alphabet, background=background)
with pyhmmer.easel.SequenceFile("data/seqs/LuxC.faa", digital=True, alphabet=alphabet) as seq_file:
hits = pipeline.search_hmm(hmm, seq_file)
We can then query the TopHits
object to access the domain hits in the sequences:
[8]:
ali = hits[0].domains[0].alignment
print(ali)
LuxC 2 anlkleeildlleevaqrlkdeeysrr..yirelakilgyeeemlkalka...lmallskeaLkdllereLgqpeildeflprkesyekaqpkglvvhllagNvpllpvmsileglLvknvnllkvSssdpflaaallksladvdadhtlarsisvvyw.kssdtkleeeivqnaDvviawGGeeAikaivkklkpgvdlikfGpkiSlavvdkeaalekaaeavAkDicvydQqAClSpqvvfvesddeaklrefaeeLaeaLekrakilPkaelsideaaaisskraeaklkyllseekkvvsekdqkwtvilseeqellnsPlsrtvnvkavpdiedvveyvtknrtlQtvglavkeselkelaeklaaaGverivevGemnlfrsGephDgvyaLqrlvrl 400
+nl+l++++++l++v+qr+++eey+rr yir+l+++lgy++em+k l+a +m l+sk+aL+d+++++Lg+ +i+de++p++++y+ka+pkg++vhllagNvpl++v+sil+++L+kn++++k+Ss+dpf+a+al++s++dv+adh+++rs+sv+yw +++d++l+++i+++aDvv awGG+eAik++vk+++++vd++kfGpk+Sl+++d+++++++aa++vA+Dic+ydQqAC+S+q++++ ++kl+ f +eL+++L+ +akilPk ++++de+aa++ +++e +l+++++v ++++q+w++i+s+ +++ n+Plsr+v++++v++ied++++++kn+t Qtv++a++es++k+++e la++G+erive+G++n+fr+G++hDg+++Lq+lv++
tr|B6ESM7|B6ESM7_ALISL 50 NNLRLNQVVNFLYTVGQRWRSEEYTRRrtYIRDLTNFLGYSNEMAK-LEAnwiAMLLCSKSALYDIVQHDLGSLHIIDEWIPQGDCYIKALPKGKSVHLLAGNVPLSGVTSILRAILTKNECIIKTSSTDPFTATALVSSFIDVNADHPITRSMSVMYWsHNEDMSLPKKIMNHADVV-AWGGDEAIKWAVKHSPHHVDIVKFGPKKSLSIIDDPEDITAAATGVAHDICFYDQQACFSTQNIYYI---GNKLNLFIDELEKKLNVYAKILPKGCQNFDEKAAFTLTEKE-----CLFAGYQVRKGENQSWIIIQSPLDAFGNQPLSRSVYIHQVSSIEDILPFINKNTT-QTVSIAPWESSFKYRDE-LAKHGAERIVESGMNNIFRVGGAHDGMRPLQYLVNY 442
899*******************************************.8*99*****************************************************************************************************************************98.*******************************************************************...9****************************************.....*****************************************************87.*****************.**********************************86 PP
Finally, we can save the results in tabular format similar to what hmmsearch
would have produced:
[9]:
with open("LuxC.domtbl", "wb") as f:
hits.write(f, format="domains")