{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Multiple Sequence Alignment to HMM" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pyhmmer\n", "pyhmmer.__version__" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alphabet = pyhmmer.easel.Alphabet.amino()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the alignment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A new HMM can be built from a single sequence, or from a multiple sequence alignment. Let's load an alignment in digital mode so that we can build our HMM:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with pyhmmer.easel.MSAFile(\"data/msa/LuxC.sto\") as msa_file:\n", " msa_file.set_digital(alphabet)\n", " msa = next(msa_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Note\n", " \n", "In this example, we load a multiple sequence alignment from a file, but if your program produces alignment and you wish to produce an HMM out of them, you can instantiate a `DigitalMSA` object yourself, e.g.:\n", " \n", "```python\n", "seq1 = pyhmmer.easel.TextSequence(name=\"seq1\", sequence=\"WVPKQDFT\")\n", "seq2 = pyhmmer.easel.TextSequence(name=\"seq2\", sequence=\"WL--PQGE\")\n", "msa = pyhmmer.easel.DigitalMSA(name=\"msa\", sequences=[seq1, seq2])\n", "```\n", "\n", "Because we need a `DigitalMSA` to build the HMM, you will have to convert it first:\n", " \n", "```python\n", "msa_d = msa.digitize(alphabet) \n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building an HMM\n", "\n", "Now that we have a multiple alignment loaded in memory, we can build a pHMM using a `pyhmmer.plan7.Builder`. This also requires a Plan7 background model to compute the transition probabilities." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "builder = pyhmmer.plan7.Builder(alphabet)\n", "background = pyhmmer.plan7.Background(alphabet)\n", "hmm, _, _ = builder.build_msa(msa, background)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can have a look at the consensus sequence of the HMM with the `consensus` property:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hmm.consensus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Saving the resulting HMM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have an HMM, we can save it to a file to avoid having to rebuild it every time. Using the `HMM.write` method lets us write the HMM in ASCII format to an arbitrary file. The resulting file will also be compatible with the `hmmsearch` binary if you wish to use that one instead of PyHMMER." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open(\"data/hmms/txt/LuxC.hmm\", \"wb\") as output_file:\n", " hmm.write(output_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Applying the HMM to a sequence database" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once a pHMM has been obtained, it can be applied to a sequence database with the `pyhmmer.plan7.Pipeline` object. Let's iterate over the protein sequences in a FASTA to see if our new HMM gets any hits:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline = pyhmmer.plan7.Pipeline(alphabet, background=background, report_e=1e-5)\n", "\n", "with pyhmmer.easel.SequenceFile(\"data/seqs/LuxC.faa\") as seq_file:\n", " seq_file.set_digital(alphabet)\n", " hits = pipeline.search_hmm(query=hmm, sequences=seq_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then query the `TopHits` object to access the domain hits in the sequences:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ali = hits[0].domains[0].alignment\n", "\n", "print(\" \"*3, ali.target_name.decode())\n", "print(\"{:3}\".format(ali.hmm_from), ali.hmm_sequence[:80] + \"...\")\n", "print(\" \"*3, ali.identity_sequence[:80] + \"...\")\n", "print(\"{:3}\".format(ali.target_from), ali.target_sequence[:80] + \"...\")\n", "print(\" \"*3, ali.hmm_name.decode())" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" } }, "nbformat": 4, "nbformat_minor": 4 }