Fetch marker genes from a genome

This example is adapted from the fetchMGs Perl script used to extract the 40 single-copy universal marker genes from a genome, annotating proteins to find the highest scoring proteins mapping to each of these marker genes.

In this notebook, we show how to reproduce this kind of analysis, using pyhmmer instead of HMMER3 to perform the alignments and extract the bit scores.

import pyhmmer

Getting the cutoffs

Each HMM has been calibrated and contains custom cutoffs, but they are not in Pfam format, so we need to use them externaly. Let’s start by downloading the file with these cutoffs from the GitHub repository of fetchMG:

import csv
import io
import urllib.request

url = "https://github.com/motu-tool/fetchMGs/raw/master/lib/MG_BitScoreCutoffs.allhits.txt"

cutoffs = {}
with urllib.request.urlopen(url) as f:
    for line in csv.reader(io.TextIOWrapper(f), dialect="excel-tab"):
        if not line[0].startswith("#"):
            cutoffs[line[0]] = float(line[1])

Downloading the HMMs

Since the HMMs for the universal marker genes are also hosted on the fetchMG GitHub repository, we can download them from there too. pyhmmer.plan7.HMMFile supports reading from a file-handle, so we can parse each HMM as we download it. We also use the occasion to update the bitscore cutoffs specific to each HMM that we just obtained in the previous step.

import urllib.request
import pyhmmer.plan7

baseurl = "https://github.com/motu-tool/fetchMGs/raw/master/lib/{}.hmm"

hmms = []
for cog in cutoffs:
    with urllib.request.urlopen(baseurl.format(cog)) as f:
        hmm = next(pyhmmer.plan7.HMMFile(f))
        cutoff = cutoffs[hmm.name.decode()]
        hmm.cutoffs.trusted = (cutoff, cutoff)

Loading the sequences

Now we need protein sequences to annotate. Let’s use our set of protein sequences identified in the chromosome of Anaerococcus provencensis.

import pyhmmer.easel
with pyhmmer.easel.SequenceFile("data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seqs_file:
    proteins = list(seqs_file)

Running the search pipeline

With the proteins loaded, let’s make a namedtuple that will contain the data we need to extract for each hmmsearch hit: the name of the query gene, the name of the marker gene which produced a hit, and the bitscore for the alignment.

import collections
Result = collections.namedtuple("Result", ["query", "cog", "bitscore"])

Now we can run the search pipeline: we annotate all the proteins will all the HMMs using the pyhmmer.hmmsearch function. Each HMM gives us a TopHits instance to process. Note that we run the search pipeline using the trusted bitscore cutoffs so that the cutoffs that we manually set on each HMMs are used to filter the results.

results = []
for top_hits in pyhmmer.hmmsearch(hmms, proteins, bit_cutoffs="trusted"):
    for hit in top_hits:
        cog = hit.best_domain.alignment.hmm_name.decode()
        results.append(Result(hit.name.decode(), cog, hit.score))

Filtering results

Now that we have all the hits that pass the bitscore thresholds, we can create a dictionary that maps each query protein to its highest scoring bitscore alignment, like in the original script. If a protein has alignments to two different marker genes with the same score, that query is ignored.

best_results = {}
keep_query = set()
for result in results:
    if result.query in best_results:
        previous_bitscore = best_results[result.query].bitscore
        if result.bitscore > previous_bitscore:
            best_results[result.query] = result
        elif result.bitscore == previous_bitscore:
            if best_results[result.query].cog != hit.cog:
        best_results[result.query] = result

Now we can get our final filtered results:

filtered_results = [best_results[k] for k in sorted(best_results) if k in keep_query]

We can print them to see which gene maps to which marker gene, with the score for each alignment. Let’s look at the top 10 results:

for result in filtered_results[:10]:
    print(result.query, "{:.1f}".format(result.bitscore), result.cog, sep="\t")
938293.PRJEB85.HG003685_102     1185.1  COG0495
938293.PRJEB85.HG003685_250     690.5   COG0215
938293.PRJEB85.HG003685_381     280.8   COG0087
938293.PRJEB85.HG003685_382     308.7   COG0088
938293.PRJEB85.HG003685_384     438.1   COG0090
938293.PRJEB85.HG003685_385     172.8   COG0185
938293.PRJEB85.HG003685_386     175.5   COG0091
938293.PRJEB85.HG003685_387     385.3   COG0092
938293.PRJEB85.HG003685_388     220.6   COG0197
938293.PRJEB85.HG003685_390     142.3   COG0186

All set! You can now use this list of identifiers to filter the initial protein array, to write proteins grouped by marker genes in a FASTA file, compute alignment against the mOTUs database, or just save it for future use.