Analyse the active site of an enzymatic domain#

This example is adapted from the method used by AntiSMASH to annotate biosynthetic gene clusters. AntiSMASH uses profile HMMs to annotate enzymatic domains in protein sequences. By matching the amino acids in the alignment, it can then predict the product specificity of the enzyme.

In this notebook, we show how to reproduce this kind of analysis, using a PKSI Acyltransferase domain built by the AntiSMASH authors (the HMM in HMMER2 format can be downloaded from their git repository).

[1]:
import pyhmmer
pyhmmer.__version__
[1]:
'0.10.15'

Loading the HMM#

Loading a HMMER profile is done with the pyhmmer.plan7.HMMFile class, which provides an iterator over the HMMs in the file. We can use the read method to get the first (and only) pyhmmer.plan7.HMM from the file.

[2]:
with pyhmmer.plan7.HMMFile("data/hmms/txt/PKSI-AT.hmm") as hmm_file:
    hmm = hmm_file.read()

Loading digitized sequences#

Easel provides the code necessary to load sequences from files in common biological formats, such as GenBank or FASTA. These utilities are wrapped by the pyhmmer.easel.SequenceFile, which provides an iterator over the sequences in the file. Note that SequenceFile tries to guess the format by default, but you can force a particular format with the format keyword argument. Note that we use digital=True to instruct the sequence file that we want to load sequences from the file into digital format. The alphabet will be guessed from the sequence content, unless given explicitly with the alphabet keyword.

[3]:
with pyhmmer.easel.SequenceFile("data/seqs/PKSI.faa", digital=True) as seq_file:
    sequences = seq_file.read_block()

Note

The C interface of Easel allows storing a sequence in two different modes: in text mode, where the sequence letters are represented as individual characters (e.g. “A” or “Y”), and digital mode, where sequence letters are encoded as digits. To make Python programs clearer, and to allow static typecheck of the storage mode, we provide two separate classes, TextSequence and DigitalSequence, that represent a sequence stored in either of these modes. Most functions that perform actual work (such as pyhmmer.hmmsearch) will expect a digital sequence.

Running a search pipeline#

With the sequences and the HMM ready, we can finally run the search pipeline: it has to be initialized with an Alphabet instance, so that the Plan7 background model can be configured accordingly. Then, we run the pipeline in search mode, providing it one HMM, and several sequences. This method returns a TopHits instance that is already sorted and thresholded.

Note

Using a Pipeline object directly is fine when you only have a single HMM to compare to a sequence database, and when your sequence database is already stored in a supported type (a SequenceFile or a DigitalSequenceBlock). However, if you plan to make many-to-many comparisons between several sequences and several pHMMs, you should use the pyhmmer.hmmer.hmmsearch function, which will take care of setting up the multithreading and let you compute results efficiently on a multi-core machine.

[4]:
pipeline = pyhmmer.plan7.Pipeline(hmm.alphabet)
hits = pipeline.search_hmm(hmm, sequences)

Rendering the alignments#

Domain instances store all the required information to report results in their alignment attribute. We can show the alignment between a HMM and a sequence like hmmsearch would as follow (using the first domain of the first hit as an example):

[5]:
ali = hits[0].domains[0].alignment
print(ali)
         PKS-AT.tcoffee   1 lFpGQGsQyaGMGreLYetePVFRqalDrCaaaLrphLgfsLlevLfgdegqeeaaaslLdqTryaQPALFAvEYALArLWrSWGvePdAVlGHSvGEyvAAcvAGVlSLEDALrLVaaRGrLMqa.lpggGaMlaVraseeevrelLapyggrlsiAAvNGPrsvVvSGdaeaieallaeLeaqGirarrLkVsHAFHSplMepmldeleevlagitpraPriPliSnvTGewltgeealdpaYWarhlRePVrFadgletLlaelGctvFlEvGPhpvLtalarrtlgesagtngadaawlaSLrrg 308
                            +FpGQG+Q+aGMG eL++++ VF++a+ +C+aaL+p++++sL +v ++ +g     a+ L++++++QP+ FAv+++LAr W+  Gv+P+AV+GHS+GE++AA+vAG+lSL+DA+r+V  R++ ++a l+g+G+Ml+ ++se+ v e+La+++ +ls+AAvNGP ++VvSGd+ +ie+l++++ea G+rar ++V++A+HS+++e +  el+evlag++p+aPr+P++S++ G+w+t+  +ld++YW+r+lR+ V Fa+++etL+ + G+t+F+Ev++hpvLt ++  t            + la+Lrr+
  sp|Q9ZGI5|PIKA1_STRVZ 635 VFPGQGTQWAGMGAELLDSSAVFAAAMAECEAALSPYVDWSLEAVVRQAPG-----APTLERVDVVQPVTFAVMVSLARVWQHHGVTPQAVVGHSQGEIAAAYVAGALSLDDAARVVTLRSKSIAAhLAGKGGMLSLALSEDAVLERLAGFD-GLSVAAVNGPTATVVSGDPVQIEELARACEADGVRARVIPVDYASHSRQVEIIESELAEVLAGLSPQAPRVPFFSTLEGAWITE-PVLDGGYWYRNLRHRVGFAPAVETLATDEGFTHFVEVSAHPVLTMALPGTV-----------TGLATLRRD 925
                            8*************************************************7.....********************************************************************988***********************96.************************************************************************************.*************************9666***************9988885...........557777775 PP

You may also want to see where the domains are located in the input sequence; using the DNA feature viewer developed by the Edinburgh Genome Foundry, we can build a summary graph aligning the protein sequences to the same reference axis:

[6]:
from dna_features_viewer import GraphicFeature, GraphicRecord
import matplotlib.pyplot as plt

# create an index so we can retrieve a Sequence from its name
seq_index = { seq.name:seq for seq in sequences }

fig, axes = plt.subplots(nrows=len(hits), figsize=(16, 6), sharex=True)
for ax, hit in zip(axes, hits):
    # add one feature per domain
    features = [
        GraphicFeature(start=d.alignment.target_from-1, end=d.alignment.target_to)
        for d in hit.domains
    ]
    length = len(seq_index[hit.name])
    desc = seq_index[hit.name].description.decode()

    # render the feature records
    record = GraphicRecord(sequence_length=length, features=features)
    record.plot(ax=ax)
    ax.set_title(desc)

# make sure everything fits in the final graph!
fig.tight_layout()
/home/docs/checkouts/readthedocs.org/user_builds/pyhmmer/envs/latest/lib/python3.11/site-packages/numpy/_core/getlimits.py:555: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/home/docs/checkouts/readthedocs.org/user_builds/pyhmmer/envs/latest/lib/python3.11/site-packages/numpy/_core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
/home/docs/checkouts/readthedocs.org/user_builds/pyhmmer/envs/latest/lib/python3.11/site-packages/numpy/_core/getlimits.py:555: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/home/docs/checkouts/readthedocs.org/user_builds/pyhmmer/envs/latest/lib/python3.11/site-packages/numpy/_core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
../_images/examples_active_site_17_1.svg

Checking individual positions for catalytic activity#

First let’s define a function to iterate over an alignement; this will come in handy later. This function yields the position in the alignment (using the HMM coordinates) and the aligned amino acid, skipping over gaps in the HMM sequence.

[7]:
def iter_target_match(alignment):
    position = alignment.hmm_from
    for hmm_letter, amino_acid in zip(alignment.hmm_sequence, alignment.target_sequence):
        if hmm_letter != ".":
            yield position, amino_acid
            position += 1

Now, for the final step, we want to check for the specificity of the enzyme domains; Del Vecchio et al. have identified two amino acids in the acyltransferase domain that once muted will decide of the enzyme specificity for either malonyl-CoA or methylmalonyl-CoA:

image0

For this, we need to check the alignment produced by HMMER, and verify the residues of the catalytic site correspond to the ones expected by the authors. We use the function we defined previously, first to check the core amino acids are not muted, and then to check the specificity of the two remaining residues.

[8]:
POSITIONS   = [ 93,  94,  95, 120, 196, 198]
EXPECTED    = ['G', 'H', 'S', 'R', 'A', 'H']
SPECIFICITY = [195, 197]

for hit in hits:
    print("\nIn sequence {!r}:".format(hit.name.decode()))
    for domain in hit.domains:
        ali = domain.alignment
        aligned = dict(iter_target_match(ali))

        print("- Found PKSI-AT domain at positions {:4} to {:4}".format(ali.target_from, ali.target_to))
        try:
            signature = [ aligned[x] for x in POSITIONS ]
            spec = [ aligned[x] for x in SPECIFICITY ]
        except KeyError:
            print("  -> Domain likely too short")
            continue
        if signature != EXPECTED:
            print("  -> Substrate specificity unknown")
        elif spec == ["H", "F"]:
            print("  -> Malonyl-CoA specific")
        elif spec == ["Y", "S"]:
            print("  -> Methylmalonyl-CoA specific")
        else:
            print("  -> Neither malonyl-CoA nor methylmalonyl-CoA specific")


In sequence 'sp|Q9ZGI5|PIKA1_STRVZ':
- Found PKSI-AT domain at positions  635 to  925
  -> Methylmalonyl-CoA specific
- Found PKSI-AT domain at positions 1651 to 1927
  -> Methylmalonyl-CoA specific
- Found PKSI-AT domain at positions 3181 to 3475
  -> Malonyl-CoA specific

In sequence 'sp|Q9ZGI2|PIKA4_STRVZ':
- Found PKSI-AT domain at positions  563 to  837
  -> Methylmalonyl-CoA specific

In sequence 'sp|A0A089QRB9|MSL3_MYCTU':
- Found PKSI-AT domain at positions  540 to  834
  -> Neither malonyl-CoA nor methylmalonyl-CoA specific

In sequence 'sp|Q9Y8A5|LOVB_ASPTE':
- Found PKSI-AT domain at positions  562 to  585
  -> Domain likely too short
- Found PKSI-AT domain at positions  651 to  854
  -> Neither malonyl-CoA nor methylmalonyl-CoA specific

In sequence 'sp|Q54FI3|STLB_DICDI':
- Found PKSI-AT domain at positions  625 to  726
  -> Domain likely too short
- Found PKSI-AT domain at positions  766 to  838
  -> Domain likely too short
- Found PKSI-AT domain at positions  880 to  944
  -> Domain likely too short