Multiple Sequence Alignment to HMM

[1]:
import pyhmmer
pyhmmer.__version__
[1]:
'0.3.0'
[2]:
alphabet = pyhmmer.easel.Alphabet.amino()

Loading the alignment

A new HMM can be built from a single sequence, or from a multiple sequence alignment. Let’s load an alignment in digital mode so that we can build our HMM:

[3]:
with pyhmmer.easel.MSAFile("data/msa/LuxC.sto") as msa_file:
    msa_file.set_digital(alphabet)
    msa = next(msa_file)

Note

In this example, we load a multiple sequence alignment from a file, but if your program produces alignment and you wish to produce an HMM out of them, you can instantiate a DigitalMSA object yourself, e.g.:

seq1 = pyhmmer.easel.TextSequence(name="seq1", sequence="WVPKQDFT")
seq2 = pyhmmer.easel.TextSequence(name="seq2", sequence="WL--PQGE")
msa  = pyhmmer.easel.DigitalMSA(name="msa", sequences=[seq1, seq2])

Because we need a DigitalMSA to build the HMM, you will have to convert it first:

msa_d = msa.digitize(alphabet)

Building an HMM

Now that we have a multiple alignment loaded in memory, we can build a pHMM using a pyhmmer.plan7.Builder. This also requires a Plan7 background model to compute the transition probabilities.

[4]:
builder = pyhmmer.plan7.Builder(alphabet)
background = pyhmmer.plan7.Background(alphabet)
hmm, _, _ = builder.build_msa(msa, background)

We can have a look at the consensus sequence of the HMM with the consensus property:

[5]:
hmm.consensus
[5]:
'lanlkleeildlleevaqrlkdeeysrryirelakilgyeeemlkalkalmallskeaLkdllereLgqpeildeflprkesyekaqpkglvvhllagNvpllpvmsileglLvknvnllkvSssdpflaaallksladvdadhtlarsisvvywkssdtkleeeivqnaDvviawGGeeAikaivkklkpgvdlikfGpkiSlavvdkeaalekaaeavAkDicvydQqAClSpqvvfvesddeaklrefaeeLaeaLekrakilPkaelsideaaaisskraeaklkyllseekkvvsekdqkwtvilseeqellnsPlsrtvnvkavpdiedvveyvtknrtlQtvglavkeselkelaeklaaaGverivevGemnlfrsGephDgvyaLqrlvrl'

Saving the resulting HMM

Now that we have an HMM, we can save it to a file to avoid having to rebuild it every time. Using the HMM.write method lets us write the HMM in ASCII format to an arbitrary file. The resulting file will also be compatible with the hmmsearch binary if you wish to use that one instead of PyHMMER.

[6]:
with open("data/hmms/txt/LuxC.hmm", "wb") as output_file:
    hmm.write(output_file)

Applying the HMM to a sequence database

Once a pHMM has been obtained, it can be applied to a sequence database with the pyhmmer.plan7.Pipeline object. Let’s iterate over the protein sequences in a FASTA to see if our new HMM gets any hits:

[7]:
pipeline = pyhmmer.plan7.Pipeline(alphabet, background=background, report_e=1e-5)

with pyhmmer.easel.SequenceFile("data/seqs/LuxC.faa") as seq_file:
    seq_file.set_digital(alphabet)
    hits = pipeline.search_hmm(query=hmm, sequences=seq_file)

We can then query the TopHits object to access the domain hits in the sequences:

[8]:
ali = hits[0].domains[0].alignment

print(" "*3, ali.target_name.decode())
print("{:3}".format(ali.hmm_from), ali.hmm_sequence[:80] + "...")
print(" "*3, ali.identity_sequence[:80] + "...")
print("{:3}".format(ali.target_from), ali.target_sequence[:80] + "...")
print(" "*3, ali.hmm_name.decode())
    tr|B6ESM7|B6ESM7_ALISL
  2 anlkleeildlleevaqrlkdeeysrr..yirelakilgyeeemlkalka...lmallskeaLkdllereLgqpeildef...
    +nl+l++++++l++v+qr+++eey+rr  yir+l+++lgy++em+k l+a   +m l+sk+aL+d+++++Lg+ +i+de+...
 50 NNLRLNQVVNFLYTVGQRWRSEEYTRRrtYIRDLTNFLGYSNEMAK-LEAnwiAMLLCSKSALYDIVQHDLGSLHIIDEW...
    LuxC