{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Analyse the active site of an enzymatic domain"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This example is adapted from the method used by [AntiSMASH](https://antismash.secondarymetabolites.org/#!/about) to annotate biosynthetic gene clusters. AntiSMASH uses profile HMMs to annotate enzymatic domains in protein sequences. By matching the amino acids in the alignment, it can then predict the product specificity of the enzyme. \n",
    "\n",
    "In this notebook, we show how to reproduce this kind of analysis, using a PKSI Acyltransferase domain built by the AntiSMASH authors (the HMM in HMMER2 format can be downloaded from [their git repository](https://github.com/antismash/antismash/blob/master/antismash/modules/active_site_finder/data/PKSI-AT.hmm2)).\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "\n",
    "References\n",
    "    \n",
    "* [Del Vecchio, F., H. Petkovic, S. G. Kendrew, L. Low, B. Wilkinson, R. Lill, J. Cortes, B. A. Rudd, J. Staunton, and P. F. Leadlay. 2003. *Active-site residue, domain and module swaps in modular polyketide synthases.* J Ind. Microbiol Biotechnol 30:489-494.](https://pubmed.ncbi.nlm.nih.gov/12811585/)\n",
    "* [Medema MH, Blin K, Cimermancic P, de Jager V, Zakrzewski P, Fischbach MA,\n",
    "Weber T, Takano E, Breitling R. *antiSMASH: rapid identification, annotation and\n",
    "analysis of secondary metabolite biosynthesis gene clusters in bacterial and\n",
    "fungal genome sequences.* Nucleic Acids Res. 2011 Jul:W339-46](https://pubmed.ncbi.nlm.nih.gov/21672958/).\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyhmmer\n",
    "pyhmmer.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the HMM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Loading a HMMER profile is done with the `pyhmmer.plan7.HMMFile` class, which provides an iterator over the HMMs in the file. We can use the `read` method to get the first (and only) `pyhmmer.plan7.HMM` from the file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with pyhmmer.plan7.HMMFile(\"data/hmms/txt/PKSI-AT.hmm\") as hmm_file:\n",
    "    hmm = hmm_file.read()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading digitized sequences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Easel provides the code necessary to load sequences from files in common biological formats, such as GenBank or FASTA. These utilities are wrapped by the `pyhmmer.easel.SequenceFile`, which provides an iterator over the sequences in the file. Note that `SequenceFile` tries to guess the format by default, but you can force a particular format with the `format` keyword argument. Note that we use `digital=True` to instruct the sequence file that we want to load sequences from the file into digital format. The alphabet will be guessed from the sequence content, unless given explicitly with the `alphabet` keyword."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with pyhmmer.easel.SequenceFile(\"data/seqs/PKSI.faa\", digital=True) as seq_file:\n",
    "    sequences = seq_file.read_block()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "Note\n",
    "\n",
    "The C interface of Easel allows storing a sequence in two different modes: in _text_ mode, where the sequence letters are represented as individual characters (e.g. \"A\" or \"Y\"), and _digital_ mode, where sequence letters are encoded as digits. To make Python programs clearer, and to allow static typecheck of the storage mode, we provide two separate classes, `TextSequence` and `DigitalSequence`, that represent a sequence stored in either of these modes. Most functions that perform actual work (such as `pyhmmer.hmmsearch`) will expect a digital sequence.\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running a search pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the sequences and the HMM ready, we can finally run the search pipeline: it has to be initialized with an `Alphabet` instance, so that the Plan7 background model can be configured accordingly. Then, we run the pipeline in search mode, providing it one HMM, and several sequences. This method returns a `TopHits` instance that is already sorted and thresholded.\n",
    "\n",
    "<div class=\"alert alert-info\">\n",
    "    \n",
    "Note\n",
    "    \n",
    "Using a `Pipeline` object directly is fine when you only have a single HMM to compare to a sequence database, and when your sequence database is already stored in a supported type (a `SequenceFile` or a `DigitalSequenceBlock`). However, if you plan to make many-to-many comparisons between several sequences and several pHMMs, you should use the `pyhmmer.hmmer.hmmsearch` function, which will take care of setting up the multithreading and let you compute results efficiently on a multi-core machine.\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipeline = pyhmmer.plan7.Pipeline(hmm.alphabet)\n",
    "hits = pipeline.search_hmm(hmm, sequences)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Rendering the alignments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Domain` instances store all the required information to report results in their `alignment` attribute. We can show the alignment between a HMM and a sequence \n",
    "like `hmmsearch` would as follow (using the first domain of the first hit as an example):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ali = hits[0].domains[0].alignment\n",
    "print(ali)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You may also want to see where the domains are located in the input sequence; using the [DNA feature viewer](https://edinburgh-genome-foundry.github.io/DnaFeaturesViewer/) developed by the [Edinburgh Genome Foundry](https://edinburgh-genome-foundry.github.io/), we can build a summary graph aligning the protein sequences to the same reference axis:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dna_features_viewer import GraphicFeature, GraphicRecord\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# create an index so we can retrieve a Sequence from its name\n",
    "seq_index = { seq.name:seq for seq in sequences }\n",
    "\n",
    "fig, axes = plt.subplots(nrows=len(hits), figsize=(16, 6), sharex=True)\n",
    "for ax, hit in zip(axes, hits):\n",
    "    # add one feature per domain\n",
    "    features = [\n",
    "        GraphicFeature(start=d.alignment.target_from-1, end=d.alignment.target_to)\n",
    "        for d in hit.domains\n",
    "    ]\n",
    "    length = len(seq_index[hit.name])\n",
    "    desc = seq_index[hit.name].description.decode()\n",
    "    \n",
    "    # render the feature records\n",
    "    record = GraphicRecord(sequence_length=length, features=features)\n",
    "    record.plot(ax=ax)\n",
    "    ax.set_title(desc)\n",
    "    \n",
    "# make sure everything fits in the final graph!\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Checking individual positions for catalytic activity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First let's define a function to iterate over an alignement; this will come in handy later. This function yields the position in the alignment (using the HMM coordinates) and the aligned amino acid, skipping over gaps in the HMM sequence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def iter_target_match(alignment):\n",
    "    position = alignment.hmm_from\n",
    "    for hmm_letter, amino_acid in zip(alignment.hmm_sequence, alignment.target_sequence):\n",
    "        if hmm_letter != \".\":\n",
    "            yield position, amino_acid      \n",
    "            position += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, for the final step, we want to check for the specificity of the enzyme domains; Del Vecchio *et al.* have identified two amino acids in the acyltransferase domain that once muted will decide of the enzyme specificity for either malonyl-CoA or methylmalonyl-CoA:\n",
    "\n",
    "![](active_site.png)\n",
    "\n",
    "For this, we need to check the alignment produced by HMMER, and verify the residues of the catalytic site correspond to the ones expected by the authors. We use the function we defined previously, first to check the core amino acids are not muted, and then to check the specificity of the two remaining residues."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "POSITIONS   = [ 93,  94,  95, 120, 196, 198]\n",
    "EXPECTED    = ['G', 'H', 'S', 'R', 'A', 'H']\n",
    "SPECIFICITY = [195, 197]\n",
    "\n",
    "for hit in hits:\n",
    "    print(\"\\nIn sequence {!r}:\".format(hit.name.decode()))\n",
    "    for domain in hit.domains:       \n",
    "        ali = domain.alignment\n",
    "        aligned = dict(iter_target_match(ali))\n",
    "        \n",
    "        print(\"- Found PKSI-AT domain at positions {:4} to {:4}\".format(ali.target_from, ali.target_to))\n",
    "        try:\n",
    "            signature = [ aligned[x] for x in POSITIONS ]\n",
    "            spec = [ aligned[x] for x in SPECIFICITY ] \n",
    "        except KeyError:\n",
    "            print(\"  -> Domain likely too short\")\n",
    "            continue\n",
    "        if signature != EXPECTED:\n",
    "            print(\"  -> Substrate specificity unknown\")\n",
    "        elif spec == [\"H\", \"F\"]:\n",
    "            print(\"  -> Malonyl-CoA specific\")\n",
    "        elif spec == [\"Y\", \"S\"]:\n",
    "            print(\"  -> Methylmalonyl-CoA specific\")\n",
    "        else:\n",
    "            print(\"  -> Neither malonyl-CoA nor methylmalonyl-CoA specific\")\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}