Build an HMM from a multiple sequence alignment

[1]:
import pyhmmer
pyhmmer.__version__
[1]:
'0.10.12'
[2]:
alphabet = pyhmmer.easel.Alphabet.amino()

Loading the alignment

A new HMM can be built from a single sequence, or from a multiple sequence alignment. Let’s load an alignment in digital mode so that we can build our HMM. We can use a MSAFile to read an alignment from a file in one of the supported formats (such as Stockholm, Clustal, aligned FASTA, etc; see the MSAFile documentation for the complete list):

[3]:
with pyhmmer.easel.MSAFile("data/msa/LuxC.sto", digital=True, alphabet=alphabet) as msa_file:
    msa = msa_file.read()

Note

In this example, we load a multiple sequence alignment from a file, but if your program produces alignment and you wish to make an HMM out of them, you can instantiate a TextMSA object yourself, e.g.:

seq1 = pyhmmer.easel.TextSequence(name=b"seq1", sequence="WVPKQDFT")
seq2 = pyhmmer.easel.TextSequence(name=b"seq2", sequence="WL--PQGE")
msa  = pyhmmer.easel.TextMSA(name=b"msa", sequences=[seq1, seq2])

Because we need a DigitalMSA to build the HMM, you will have to convert it first:

msa_d = msa.digitize(alphabet)

Building an HMM

Now that we have a multiple alignment loaded in memory, we can build a pHMM using a pyhmmer.plan7.Builder. This also requires a Plan7 background model to compute the transition probabilities.

[4]:
builder = pyhmmer.plan7.Builder(alphabet)
background = pyhmmer.plan7.Background(alphabet)
hmm, _, _ = builder.build_msa(msa, background)

Warning

HMMs must always have a name. The Builder will attempt to use the MSA name, but in the event your alignment file does not contain a name for the whole alignment (for instance, a #=GF ID line in a Stockholm file), this method will fail:

>>> anonymous_msa = pyhmmer.easel.TextMSA(sequences=[seq1, seq2])
>>> builder.build_msa(anonymous_msa)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: Could not build HMM: Unable to name the HMM.

If your alignment is missing a name, you can assign one manually by setting the MSA.name property, e.g.:

msa.name = b"protein_family_01"

We can have a look at the consensus sequence of the HMM with the consensus property:

[5]:
hmm.consensus
[5]:
'lanlkleeildlleevaqrlkdeeysrryirelakilgyeeemlkalkalmallskeaLkdllereLgqpeildeflprkesyekaqpkglvvhllagNvpllpvmsileglLvknvnllkvSssdpflaaallksladvdadhtlarsisvvywkssdtkleeeivqnaDvviawGGeeAikaivkklkpgvdlikfGpkiSlavvdkeaalekaaeavAkDicvydQqAClSpqvvfvesddeaklrefaeeLaeaLekrakilPkaelsideaaaisskraeaklkyllseekkvvsekdqkwtvilseeqellnsPlsrtvnvkavpdiedvveyvtknrtlQtvglavkeselkelaeklaaaGverivevGemnlfrsGephDgvyaLqrlvrl'

Saving the resulting HMM

Now that we have an HMM, we can save it to a file to avoid having to rebuild it every time. Using the HMM.write method lets us write the HMM in ASCII or binary format to an arbitrary file. The resulting file will also be compatible with the hmmsearch binary if you wish to use that one instead of PyHMMER.

[6]:
with open("LuxC.hmm", "wb") as output_file:
    hmm.write(output_file)

Applying the HMM to a sequence database

Once a pHMM has been obtained, it can be applied to a sequence database with the pyhmmer.plan7.Pipeline object. Let’s iterate over the protein sequences in a FASTA to see if our new HMM gets any hits:

[7]:
pipeline = pyhmmer.plan7.Pipeline(alphabet, background=background)
with pyhmmer.easel.SequenceFile("data/seqs/LuxC.faa", digital=True, alphabet=alphabet) as seq_file:
    hits = pipeline.search_hmm(hmm, seq_file)

We can then query the TopHits object to access the domain hits in the sequences:

[8]:
ali = hits[0].domains[0].alignment
print(ali)
                    LuxC   2 anlkleeildlleevaqrlkdeeysrr..yirelakilgyeeemlkalka...lmallskeaLkdllereLgqpeildeflprkesyekaqpkglvvhllagNvpllpvmsileglLvknvnllkvSssdpflaaallksladvdadhtlarsisvvyw.kssdtkleeeivqnaDvviawGGeeAikaivkklkpgvdlikfGpkiSlavvdkeaalekaaeavAkDicvydQqAClSpqvvfvesddeaklrefaeeLaeaLekrakilPkaelsideaaaisskraeaklkyllseekkvvsekdqkwtvilseeqellnsPlsrtvnvkavpdiedvveyvtknrtlQtvglavkeselkelaeklaaaGverivevGemnlfrsGephDgvyaLqrlvrl 400
                             +nl+l++++++l++v+qr+++eey+rr  yir+l+++lgy++em+k l+a   +m l+sk+aL+d+++++Lg+ +i+de++p++++y+ka+pkg++vhllagNvpl++v+sil+++L+kn++++k+Ss+dpf+a+al++s++dv+adh+++rs+sv+yw +++d++l+++i+++aDvv awGG+eAik++vk+++++vd++kfGpk+Sl+++d+++++++aa++vA+Dic+ydQqAC+S+q++++    ++kl+ f +eL+++L+ +akilPk ++++de+aa++ +++e     +l+++++v ++++q+w++i+s+ +++ n+Plsr+v++++v++ied++++++kn+t Qtv++a++es++k+++e la++G+erive+G++n+fr+G++hDg+++Lq+lv++
  tr|B6ESM7|B6ESM7_ALISL  50 NNLRLNQVVNFLYTVGQRWRSEEYTRRrtYIRDLTNFLGYSNEMAK-LEAnwiAMLLCSKSALYDIVQHDLGSLHIIDEWIPQGDCYIKALPKGKSVHLLAGNVPLSGVTSILRAILTKNECIIKTSSTDPFTATALVSSFIDVNADHPITRSMSVMYWsHNEDMSLPKKIMNHADVV-AWGGDEAIKWAVKHSPHHVDIVKFGPKKSLSIIDDPEDITAAATGVAHDICFYDQQACFSTQNIYYI---GNKLNLFIDELEKKLNVYAKILPKGCQNFDEKAAFTLTEKE-----CLFAGYQVRKGENQSWIIIQSPLDAFGNQPLSRSVYIHQVSSIEDILPFINKNTT-QTVSIAPWESSFKYRDE-LAKHGAERIVESGMNNIFRVGGAHDGMRPLQYLVNY 442
                             899*******************************************.8*99*****************************************************************************************************************************98.*******************************************************************...9****************************************.....*****************************************************87.*****************.**********************************86 PP

Finally, we can save the results in tabular format similar to what hmmsearch would have produced:

[9]:
with open("LuxC.domtbl", "wb") as f:
    hits.write(f, format="domains")