Sequence#

class pyhmmer.easel.Sequence#

An abstract biological sequence with some associated metadata.

Easel provides two different mode to store a sequence: text, or digital. In the HMMER code, changing from one mode to another mode is done in place, which allows recycling memory. However, doing so can be confusing since there is no way to know statically the representation of a sequence.

To avoid this, pyhmmer provides two subclasses of the Sequence abstract class to maintain the mode contract: TextSequence and DigitalSequence. Functions expecting sequences in digital format, like pyhmmer.hmmer.hmmsearch, can then use Python type system to make sure they receive sequences in the right mode. This allows type checkers such as mypy to detect potential contract breaches at compile-time.

checksum()#: Calculate a 32-bit checksum for the sequence.

clear()#: Reinitialize the sequence for re-use.

copy()#: Duplicate the sequence, and return the copy.

write(fh)#

Write the sequence to a file handle, in FASTA format.

Arguments:4: fh (io.IOBase): A Python file handle, opened in binary mode.

Added in version 0.3.0.

L#

The length of the sequence.

Added in version 0.12.1.

Type:: int

accession#

The accession of the sequence.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:: str

description#

The description of the sequence.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:: str

name#

The name of the sequence.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:: str

residue_markups#

Extra residue markups, mapping information to each position.

Keys and values are not decoded, since they are not necessarily valid UTF-8 bytestrings.

Caution

The values of the dictionary must be the same size as the sequence itself. Trying to set a residue markup of the wrong length will raise a ValueError:

>>> seq = TextSequence(sequence="TTAATTGGT")
>>> seq.residue_markups = {"quality": "efcfffffcfee"}
Traceback (most recent call last):
  ...
ValueError: Residue markup annotation has an invalid length (expected 9, got 12)

Added in version 0.4.6.

Type:: dict

source#

The source of the sequence, if any.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:: str

class pyhmmer.easel.TextSequence#

A biological sequence stored in text mode.

Hint

Use the sequence property to access the sequence letters as a Python string.

Added in version 0.10.4: pickle protocol support.

sample(max_length, randomness=None)#

Sample a sequence of length at most L at random.

Parameters:

max_length (int) – The maximum length of the sequence to generate (the actual sequence length is sampled).
randomness (Randomness, int or None) – The random number generator to use for sampling, or a seed to initialize a generator. If None or 0 given, create a new random number generator with a random seed.

Returns:

TextSequence – A new text sequence generated at random.

Hint

This constructor is only useful for testing and should not be used to generate random sequences to e.g. compute a background distribution for a statistical method, since this function samples alphabet residues at random irrespective of prior frequences.

Added in version 0.11.1.

__init__(*, name=None, description=None, accession=None, sequence=None, source=None, residue_markups=None)#: Create a new text-mode sequence with the given attributes.

Added in version 0.10.4: The residue_markups argument.

Deprecated since version 0.11.1: Passing positional arguments to constructor.

checksum()#: Calculate a 32-bit checksum for the sequence.

clear()#: Reinitialize the sequence for re-use.

copy()#: Duplicate the text sequence, and return the copy.

digitize(alphabet)#

Convert the text sequence to a digital sequence using alphabet.

Returns:: DigitalSequence – A copy of the sequence in digital mode, digitized with alphabet.

reverse_complement(inplace=False)#

Build the reverse complement of the sequence.

This method assumes that the sequence alphabet is IUPAC/DNA. If the sequence contains any unknown letters, they will be replaced by \(N\) in the reverse-complement.

Parameters:: inplace (bool) – Whether or not to copy the sequence before computing its reverse complement. With False (the default), the method will return a copy of the sequence that has been reverse-complemented. With True, it will reverse-complement inplace and return None.
Raises:: UserWarning – When the sequence contains unknown characters.

Example

>>> seq = TextSequence(sequence="ATGC")
>>> seq.reverse_complement().sequence
'GCAT'

Caution

The copy made when inplace is False is an exact copy, so the name, description and accession of the copy will be the same. This could lead to duplicates if you’re not careful!

Added in version 0.3.0.

write(fh)#

Write the sequence to a file handle, in FASTA format.

Arguments:4: fh (io.IOBase): A Python file handle, opened in binary mode.

Added in version 0.3.0.

L#

The length of the sequence.

Added in version 0.12.1.

Type:: int

accession#

The accession of the sequence.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:: str

description#

The description of the sequence.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:: str

name#

The name of the sequence.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:: str

residue_markups#

Extra residue markups, mapping information to each position.

Keys and values are not decoded, since they are not necessarily valid UTF-8 bytestrings.

Caution

The values of the dictionary must be the same size as the sequence itself. Trying to set a residue markup of the wrong length will raise a ValueError:

>>> seq = TextSequence(sequence="TTAATTGGT")
>>> seq.residue_markups = {"quality": "efcfffffcfee"}
Traceback (most recent call last):
  ...
ValueError: Residue markup annotation has an invalid length (expected 9, got 12)

Added in version 0.4.6.

Type:: dict

sequence#

The raw sequence letters, as a Python string.

Type:: str

source#

The source of the sequence, if any.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:: str

class pyhmmer.easel.DigitalSequence#

A biological sequence stored in digital mode.

alphabet#

The biological alphabet used to encode this sequence to digits.

Type:: Alphabet, readonly

Hint

Use the sequence property to access the sequence digits as a memory view, allowing to access the individual bytes. This can be combined with numpy.asarray to get the sequence as an array with zero-copy.

Added in version 0.10.4: pickle protocol support.

sample(alphabet, max_length, randomness=None)#

Sample a sequence of length at most L at random.

Parameters:

alphabet (Alphabet) – The alphabet of the sequence.
max_length (int) – The maximum length of the sequence to generate (the actual sequence length is sampled).
randomness (Randomness, int or None) – The random number generator to use for sampling, or a seed to initialize a generator. If None or 0 given, create a new random number generator with a random seed.

Returns:

DigitalSequence – A new digital sequence generated at random, including degenerate symbols.

Hint

This constructor is only useful for testing and should not be used to generate random sequences to e.g. compute a background distribution for a statistical method, since this function samples alphabet residues at random irrespective of prior frequences.

Added in version 0.11.1.

__init__(alphabet, *, name=None, description=None, accession=None, sequence=None, source=None, residue_markups=None)#

Create a new digital-mode sequence with the given attributes.

Raises:: ValueError – When sequence contains digits outside the alphabet symbol range.

Added in version 0.1.4.

Added in version 0.10.4: The residue_markups argument.

Deprecated since version 0.11.1: Passing positional arguments other than alphabet.

checksum()#: Calculate a 32-bit checksum for the sequence.

clear()#: Reinitialize the sequence for re-use.

copy()#: Duplicate the digital sequence, and return the copy.

reverse_complement(inplace=False)#

Build the reverse complement of the sequence.

Parameters:: inplace (bool) – Whether or not to copy the sequence before computing its reverse complement. With False (the default), the method will return a copy of the sequence that has been reverse-complemented. With True, it will reverse-complement inplace and return None.
Raises:: ValueError – When the alphabet of the DigitalSequence does not have a complement mapping set (e.g., Alphabet.amino).

Caution

The copy made when inplace is False is an exact copy, so the name, description and accession of the copy will be the same. This could lead to duplicates if you’re not careful!

Added in version 0.3.0.

textize()#

Convert the digital sequence to a text sequence.

Returns:: TextSequence – A copy of the sequence in text-mode.

Added in version 0.1.4.

translate(genetic_code=GeneticCode(1))#

Translate the sequence using the given genetic code.

Parameters:

genetic_code (GeneticCode) – The genetic code to use for translating the sequence. If none provided, the default uses the standard translation table (1) and expects DNA sequences.

Returns:

DigitalSequence – The translation of the input sequence, in digital mode.

Raises:

pyhmmer.errors.AlphabetMismatch – When the genetic_code expects a different nucleotide alphabet than the one currently in use to encode the sequence.
ValueError – When sequence could not be translated properly, because of a codon could not be recognized, or because the sequence has an invalid length.

Note

The translation of a DNA/RNA codon supports ambiguous codons. If the amino acid is unambiguous, despite codon ambiguity, the correct amino acid is still determined: GGR translates as Gly, UUY as Phe, etc. If there is no single unambiguous amino acid translation, the codon is translated as X. Ambiguous amino acids (such as J or B) are never produced.

Added in version 0.7.2.

write(fh)#

Write the sequence to a file handle, in FASTA format.

Arguments:4: fh (io.IOBase): A Python file handle, opened in binary mode.

Added in version 0.3.0.

L#

The length of the sequence.

Added in version 0.12.1.

Type:: int

accession#

The accession of the sequence.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:: str

description#

The description of the sequence.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:: str

name#

The name of the sequence.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:: str

residue_markups#

Extra residue markups, mapping information to each position.

Keys and values are not decoded, since they are not necessarily valid UTF-8 bytestrings.

Caution

The values of the dictionary must be the same size as the sequence itself. Trying to set a residue markup of the wrong length will raise a ValueError:

>>> seq = TextSequence(sequence="TTAATTGGT")
>>> seq.residue_markups = {"quality": "efcfffffcfee"}
Traceback (most recent call last):
  ...
ValueError: Residue markup annotation has an invalid length (expected 9, got 12)

Added in version 0.4.6.

Type:: dict

sequence#

The raw sequence digits, as a byte vector.

Note

The internal ESL_SQ object allocates a buffer of size \(n+2\) (where \(n\) is the number of residues in the sequence), with the first and the last element of the buffer being sentinel values. This vector does not expose the sentinel values, only the \(n\) elements of the buffer in between.

Changed in version 0.4.0: Property is now a VectorU8 instead of memoryview.

Type:: VectorU8

source#

The source of the sequence, if any.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:: str