Active Site Analysis

This example is adapted from the method used by AntiSMASH to annotate biosynthetic gene clusters. AntiSMASH uses profile HMMs to annotate enzymatic domains in protein sequences. By matching the amino acids in the alignment, it can then predict the product specificity of the enzyme.

In this notebook, we show how to reproduce this kind of analysis, using a PKSI Acyltransferase domain built by the AntiSMASH authors (the HMM in HMMER2 format can be downloaded from their git repository).

import pyhmmer

Loading the HMM

Loading a HMMER profile is done with the pyhmmer.plan7.HMMFile class, which provides an iterator over the HMMs in the file. Since we only use a single HMM, we can simply use next to get the first (and only) pyhmmer.plan7.HMM.

with pyhmmer.plan7.HMMFile("data/hmms/txt/PKSI-AT.hmm") as hmm_file:
    hmm = next(hmm_file)

Building digitized sequences

Easel provides the code necessary to load sequences from files in common biological formats, such as GenBank or FASTA. These utilities are wrapped by the pyhmmer.easel.SequenceFile, which provides an iterator over the sequences in the file. Note that SequenceFile tries to guess the format by default, but you can force a particular format with the format keyword argument.

with pyhmmer.easel.SequenceFile("data/seqs/PKSI.faa") as seq_file:
    sequences = [ seq.digitize(hmm.alphabet) for seq in seq_file ]


The C interface of Easel allows storing a sequence in two different modes: in text mode, where the sequence letters are represented as individual characters (e.g. “A” or “Y”), and digital mode, where sequence letters are encoded as digits. To make Python programs clearer, and to allow static typecheck of the storage mode, we provide two separate classes, TextSequence and DigitalSequence, that represent a sequence stored in either of these modes.

SequenceFile yields sequences in text mode, but HMMER expects sequences in digital mode, so we must digitize them. This requires the sequence alphabet to be known, but we can just use the Alphabet instance stored in the alphabet attribute of hmm.

Running a search pipeline

With the sequences and the HMM ready, we can finally run the search pipeline: it has to be initialized with an Alphabet instance, so that the Plan7 background model can be configured accordingly. Then, we run the pipeline in search mode, providing it one HMM, and several sequences. This method returns a TopHits instance that is already sorted and thresholded.

pipeline = pyhmmer.plan7.Pipeline(hmm.alphabet)
hits = pipeline.search_hmm(hmm, sequences)

Rendering the alignments

Domain instances store all the required information to report results in their alignment attribute. We can show the alignment between a HMM and a sequence like hmmsearch would as follow (using the first domain of the first hit as an example):

ali = hits[0].domains[0].alignment

print(" "*3, ali.target_name.decode())
print("{:3}".format(ali.hmm_from), ali.hmm_sequence, "{:3}".format(ali.hmm_to))
print(" "*3, ali.identity_sequence)
print("{:3}".format(ali.target_from), ali.target_sequence, "{:3}".format(ali.target_to))
print(" "*3, ali.hmm_name.decode())
  1 lFpGQGsQyaGMGreLYetePVFRqalDrCaaaLrphLgfsLlevLfgdegqeeaaaslLdqTryaQPALFAvEYALArLWrSWGvePdAVlGHSvGEyvAAcvAGVlSLEDALrLVaaRGrLMqa.lpggGaMlaVraseeevrelLapyggrlsiAAvNGPrsvVvSGdaeaieallaeLeaqGirarrLkVsHAFHSplMepmldeleevlagitpraPriPliSnvTGewltgeealdpaYWarhlRePVrFadgletLlaelGctvFlEvGPhpvLtalarrtlgesagtngadaawlaSLrrg 308
    +FpGQG+Q+aGMG eL++++ VF++a+ +C+aaL+p++++sL +v ++ +g     a+ L++++++QP+ FAv+++LAr W+  Gv+P+AV+GHS+GE++AA+vAG+lSL+DA+r+V  R++ ++a l+g+G+Ml+ ++se+ v e+La+++ +ls+AAvNGP ++VvSGd+ +ie+l++++ea G+rar ++V++A+HS+++e +  el+evlag++p+aPr+P++S++ G+w+t+  +ld++YW+r+lR+ V Fa+++etL+ + G+t+F+Ev++hpvLt ++  t            + la+Lrr+

You may also want to see where the domains are located in the input sequence; using the DNA feature viewer developed by the Edinburgh Genome Foundry, we can build a summary graph aligning the protein sequences to the same reference axis:

from dna_features_viewer import GraphicFeature, GraphicRecord
import matplotlib.pyplot as plt

# create an index so we can retrieve a Sequence from its name
seq_index = { for seq in sequences }

fig, axes = plt.subplots(nrows=len(hits), figsize=(16, 6), sharex=True)
for ax, hit in zip(axes, hits):
    # add one feature per domain
    features = [
        GraphicFeature(start=d.alignment.target_from-1, end=d.alignment.target_to)
        for d in
    length = len(seq_index[])
    desc = seq_index[].description.decode()

    # render the feature records
    record = GraphicRecord(sequence_length=length, features=features)

# make sure everything fits in the final graph!
/home/docs/checkouts/ FutureWarning: --rc={'figure.dpi': 96} for dict-traits is deprecated in traitlets 5.0. You can pass --rc <key=value> ... multiple times to add items to a dict.

Checking individual positions for catalytic activity

First let’s define a function to iterate over an alignement; this will come in handy later. This function yields the position in the alignment (using the HMM coordinates) and the aligned amino acid, skipping over gaps in the HMM sequence.

def iter_target_match(alignment):
    position = alignment.hmm_from
    for hmm_letter, amino_acid in zip(alignment.hmm_sequence, alignment.target_sequence):
        if hmm_letter != ".":
            yield position, amino_acid
            position += 1

Now, for the final step, we want to check for the specificity of the enzyme domains; Del Vecchio et al. have identified two amino acids in the acyltransferase domain that once muted will decide of the enzyme specificity for either malonyl-CoA or methylmalonyl-CoA:


For this, we need to check the alignment produced by HMMER, and verify the residues of the catalytic site correspond to the ones expected by the authors. We use the function we defined previously, first to check the core amino acids are not muted, and then to check the specificity of the two remaining residues.

POSITIONS   = [ 93,  94,  95, 120, 196, 198]
EXPECTED    = ['G', 'H', 'S', 'R', 'A', 'H']
SPECIFICITY = [195, 197]

for hit in hits:
    print("\nIn sequence {!r}:".format(
    for domain in
        ali = domain.alignment
        aligned = dict(iter_target_match(ali))

        print("- Found PKSI-AT domain at positions {:4} to {:4}".format(ali.target_from, ali.target_to))
            signature = [ aligned[x] for x in POSITIONS ]
            spec = [ aligned[x] for x in SPECIFICITY ]
        except KeyError:
            print("  -> Domain likely too short")
        if signature != EXPECTED:
            print("  -> Substrate specificity unknown")
        elif spec == ["H", "F"]:
            print("  -> Malonyl-CoA specific")
        elif spec == ["Y", "S"]:
            print("  -> Methylmalonyl-CoA specific")
            print("  -> Neither malonyl-CoA nor methylmalonyl-CoA specific")

In sequence 'sp|Q9ZGI5|PIKA1_STRVZ':
- Found PKSI-AT domain at positions  635 to  925
  -> Methylmalonyl-CoA specific
- Found PKSI-AT domain at positions 1651 to 1927
  -> Methylmalonyl-CoA specific
- Found PKSI-AT domain at positions 3181 to 3475
  -> Malonyl-CoA specific

In sequence 'sp|Q9ZGI2|PIKA4_STRVZ':
- Found PKSI-AT domain at positions  563 to  837
  -> Methylmalonyl-CoA specific

In sequence 'sp|A0A089QRB9|MSL3_MYCTU':
- Found PKSI-AT domain at positions  540 to  834
  -> Neither malonyl-CoA nor methylmalonyl-CoA specific

In sequence 'sp|Q9Y8A5|LOVB_ASPTE':
- Found PKSI-AT domain at positions  562 to  585
  -> Domain likely too short
- Found PKSI-AT domain at positions  651 to  854
  -> Neither malonyl-CoA nor methylmalonyl-CoA specific

In sequence 'sp|Q54FI3|STLB_DICDI':
- Found PKSI-AT domain at positions  625 to  726
  -> Domain likely too short
- Found PKSI-AT domain at positions  766 to  838
  -> Domain likely too short
- Found PKSI-AT domain at positions  880 to  944
  -> Domain likely too short