{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Build an HMM from a multiple sequence alignment" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pyhmmer\n", "pyhmmer.__version__" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alphabet = pyhmmer.easel.Alphabet.amino()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the alignment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A new HMM can be built from a single sequence, or from a multiple sequence alignment. Let's load an alignment in digital mode so that we can build our HMM. We can use a `MSAFile` to read an alignment from a file in one of the supported formats (such as [Stockholm](https://www.tbi.univie.ac.at/RNA/ViennaRNA/doc/html/file_formats.html#msa-formats-stockholm), [Clustal](https://www.tbi.univie.ac.at/RNA/ViennaRNA/doc/html/file_formats.html#msa-formats-clustal), [aligned FASTA](https://www.tbi.univie.ac.at/RNA/ViennaRNA/doc/html/file_formats.html#msa-formats-fasta), *etc*; see the `MSAFile` documentation for the complete list):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with pyhmmer.easel.MSAFile(\"data/msa/LuxC.sto\", digital=True, alphabet=alphabet) as msa_file:\n", " msa = msa_file.read()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Note\n", " \n", "In this example, we load a multiple sequence alignment from a file, but if your program produces alignment and you wish to make an HMM out of them, you can instantiate a `TextMSA` object yourself, e.g.:\n", " \n", "```python\n", "seq1 = pyhmmer.easel.TextSequence(name=b\"seq1\", sequence=\"WVPKQDFT\")\n", "seq2 = pyhmmer.easel.TextSequence(name=b\"seq2\", sequence=\"WL--PQGE\")\n", "msa = pyhmmer.easel.TextMSA(name=b\"msa\", sequences=[seq1, seq2])\n", "```\n", "\n", "Because we need a `DigitalMSA` to build the HMM, you will have to convert it first:\n", " \n", "```python\n", "msa_d = msa.digitize(alphabet) \n", "```\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building an HMM\n", "\n", "Now that we have a multiple alignment loaded in memory, we can build a pHMM using a `pyhmmer.plan7.Builder`. This also requires a Plan7 background model to compute the transition probabilities." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "builder = pyhmmer.plan7.Builder(alphabet)\n", "background = pyhmmer.plan7.Background(alphabet)\n", "hmm, _, _ = builder.build_msa(msa, background)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "Warning\n", " \n", "HMMs must always have a name. The `Builder` will attempt to use the `MSA` name, but in the event your alignment file does not contain a name for the whole alignment (for instance, a `#=GF ID` line in a [Stockholm file](https://en.wikipedia.org/wiki/Stockholm_format)), this method will fail:\n", " \n", "```python\n", ">>> anonymous_msa = pyhmmer.easel.TextMSA(sequences=[seq1, seq2])\n", ">>> builder.build_msa(anonymous_msa)\n", "Traceback (most recent call last):\n", " File \"\", line 1, in \n", "ValueError: Could not build HMM: Unable to name the HMM.\n", "``` \n", "\n", "If your alignment is missing a name, you can assign one manually by setting the `MSA.name` property, e.g.:\n", "\n", "```python\n", "msa.name = b\"protein_family_01\"\n", "```\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can have a look at the consensus sequence of the HMM with the `consensus` property:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hmm.consensus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Saving the resulting HMM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have an HMM, we can save it to a file to avoid having to rebuild it every time. Using the `HMM.write` method lets us write the HMM in ASCII or binary format to an arbitrary file. The resulting file will also be compatible with the `hmmsearch` binary if you wish to use that one instead of PyHMMER." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open(\"LuxC.hmm\", \"wb\") as output_file:\n", " hmm.write(output_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Applying the HMM to a sequence database" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once a pHMM has been obtained, it can be applied to a sequence database with the `pyhmmer.plan7.Pipeline` object. Let's iterate over the protein sequences in a FASTA to see if our new HMM gets any hits:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline = pyhmmer.plan7.Pipeline(alphabet, background=background)\n", "with pyhmmer.easel.SequenceFile(\"data/seqs/LuxC.faa\", digital=True, alphabet=alphabet) as seq_file:\n", " hits = pipeline.search_hmm(hmm, seq_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then query the `TopHits` object to access the domain hits in the sequences:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ali = hits[0].domains[0].alignment\n", "print(ali)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we can save the results in tabular format similar to what `hmmsearch` would have produced:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open(\"LuxC.domtbl\", \"wb\") as f:\n", " hits.write(f, format=\"domains\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" } }, "nbformat": 4, "nbformat_minor": 4 }