Build an HMM from a multiple sequence alignment¶
[1]:
import pyhmmer
pyhmmer.__version__
[1]:
'0.4.11'
[2]:
alphabet = pyhmmer.easel.Alphabet.amino()
Loading the alignment¶
A new HMM can be built from a single sequence, or from a multiple sequence alignment. Let’s load an alignment in digital mode so that we can build our HMM:
[3]:
with pyhmmer.easel.MSAFile("data/msa/LuxC.sto") as msa_file:
msa_file.set_digital(alphabet)
msa = next(msa_file)
Note
In this example, we load a multiple sequence alignment from a file, but if your program produces alignment and you wish to make an HMM out of them, you can instantiate a TextMSA
object yourself, e.g.:
seq1 = pyhmmer.easel.TextSequence(name="seq1", sequence="WVPKQDFT")
seq2 = pyhmmer.easel.TextSequence(name="seq2", sequence="WL--PQGE")
msa = pyhmmer.easel.TextMSA(name="msa", sequences=[seq1, seq2])
Because we need a DigitalMSA
to build the HMM, you will have to convert it first:
msa_d = msa.digitize(alphabet)
Building an HMM¶
Now that we have a multiple alignment loaded in memory, we can build a pHMM using a pyhmmer.plan7.Builder
. This also requires a Plan7 background model to compute the transition probabilities.
[4]:
builder = pyhmmer.plan7.Builder(alphabet)
background = pyhmmer.plan7.Background(alphabet)
hmm, _, _ = builder.build_msa(msa, background)
We can have a look at the consensus sequence of the HMM with the consensus
property:
[5]:
hmm.consensus
[5]:
'lanlkleeildlleevaqrlkdeeysrryirelakilgyeeemlkalkalmallskeaLkdllereLgqpeildeflprkesyekaqpkglvvhllagNvpllpvmsileglLvknvnllkvSssdpflaaallksladvdadhtlarsisvvywkssdtkleeeivqnaDvviawGGeeAikaivkklkpgvdlikfGpkiSlavvdkeaalekaaeavAkDicvydQqAClSpqvvfvesddeaklrefaeeLaeaLekrakilPkaelsideaaaisskraeaklkyllseekkvvsekdqkwtvilseeqellnsPlsrtvnvkavpdiedvveyvtknrtlQtvglavkeselkelaeklaaaGverivevGemnlfrsGephDgvyaLqrlvrl'
Saving the resulting HMM¶
Now that we have an HMM, we can save it to a file to avoid having to rebuild it every time. Using the HMM.write
method lets us write the HMM in ASCII or binary format to an arbitrary file. The resulting file will also be compatible with the hmmsearch
binary if you wish to use that one instead of pyHMMER.
[6]:
with open("LuxC.hmm", "wb") as output_file:
hmm.write(output_file)
Applying the HMM to a sequence database¶
Once a pHMM has been obtained, it can be applied to a sequence database with the pyhmmer.plan7.Pipeline
object. Let’s iterate over the protein sequences in a FASTA to see if our new HMM gets any hits:
[7]:
with pyhmmer.easel.SequenceFile("data/seqs/LuxC.faa") as seq_file:
seq_file.set_digital(alphabet)
sequences = list(seq_file)
pipeline = pyhmmer.plan7.Pipeline(alphabet, background=background)
hits = pipeline.search_hmm(query=hmm, sequences=sequences)
We can then query the TopHits
object to access the domain hits in the sequences:
[8]:
ali = hits[0].domains[0].alignment
print(" "*3, ali.target_name.decode())
print("{:3}".format(ali.hmm_from), ali.hmm_sequence[:80] + "...")
print(" "*3, ali.identity_sequence[:80] + "...")
print("{:3}".format(ali.target_from), ali.target_sequence[:80] + "...")
print(" "*3, ali.hmm_name.decode())
tr|B6ESM7|B6ESM7_ALISL
2 anlkleeildlleevaqrlkdeeysrr..yirelakilgyeeemlkalka...lmallskeaLkdllereLgqpeildef...
+nl+l++++++l++v+qr+++eey+rr yir+l+++lgy++em+k l+a +m l+sk+aL+d+++++Lg+ +i+de+...
50 NNLRLNQVVNFLYTVGQRWRSEEYTRRrtYIRDLTNFLGYSNEMAK-LEAnwiAMLLCSKSALYDIVQHDLGSLHIIDEW...
LuxC