Sequence#
- class pyhmmer.easel.Sequence#
An abstract biological sequence with some associated metadata.
Easel provides two different mode to store a sequence: text, or digital. In the HMMER code, changing from one mode to another mode is done in place, which allows recycling memory. However, doing so can be confusing since there is no way to know statically the representation of a sequence.
To avoid this,
pyhmmer
provides two subclasses of theSequence
abstract class to maintain the mode contract:TextSequence
andDigitalSequence
. Functions expecting sequences in digital format, likepyhmmer.hmmer.hmmsearch
, can then use Python type system to make sure they receive sequences in the right mode. This allows type checkers such asmypy
to detect potential contract breaches at compile-time.- __copy__()#
None
- __sizeof__()#
None
- checksum()#
Calculate a 32-bit checksum for the sequence.
- clear()#
Reinitialize the sequence for re-use.
- copy()#
Duplicate the sequence, and return the copy.
- write(fh)#
Write the sequence alignement to a file handle, in FASTA format.
- Parameters:
fh (
io.IOBase
) – A Python file handle, opened in binary mode.
Added in version 0.3.0.
- residue_markups#
Extra residue markups, mapping information to each position.
Keys and values are not decoded, since they are not necessarily valid UTF-8 bytestrings.
Caution
The values of the dictionary must be the same size as the sequence itself. Trying to set a residue markup of the wrong length will raise a
ValueError
:>>> seq = TextSequence(sequence="TTAATTGGT") >>> seq.residue_markups = {b"quality": b"efcfffffcfee"} Traceback (most recent call last): ... ValueError: Residue markup annotation has an invalid length (expected 9, got 12)
Added in version 0.4.6.
- Type:
- class pyhmmer.easel.TextSequence#
A biological sequence stored in text mode.
Hint
Use the
sequence
property to access the sequence letters as a Python string.Added in version 0.10.4:
pickle
protocol support.- __init__(name=None, description=None, accession=None, sequence=None, source=None, residue_markups=None)#
Create a new text-mode sequence with the given attributes.
Added in version 0.10.4: The
residue_markups
argument.
- __reduce__()#
None
- copy()#
Duplicate the text sequence, and return the copy.
- digitize(alphabet)#
Convert the text sequence to a digital sequence using
alphabet
.- Returns:
DigitalSequence
– A copy of the sequence in digital mode, digitized withalphabet
.
- reverse_complement(inplace=False)#
Build the reverse complement of the sequence.
This method assumes that the sequence alphabet is IUPAC/DNA. If the sequence contains any unknown letters, they will be replaced by \(N\) in the reverse-complement.
- Parameters:
inplace (
bool
) – Whether or not to copy the sequence before computing its reverse complement. WithFalse
(the default), the method will return a copy of the sequence that has been reverse-complemented. WithTrue
, it will reverse-complement inplace and returnNone
.- Raises:
UserWarning – When the sequence contains unknown characters.
Example
>>> seq = TextSequence(sequence="ATGC") >>> seq.reverse_complement().sequence 'GCAT'
Caution
The copy made when
inplace
isFalse
is an exact copy, so thename
,description
andaccession
of the copy will be the same. This could lead to duplicates if you’re not careful!Added in version 0.3.0.
- class pyhmmer.easel.DigitalSequence#
A biological sequence stored in digital mode.
Hint
Use the
sequence
property to access the sequence digits as a memory view, allowing to access the individual bytes. This can be combined withnumpy.asarray
to get the sequence as an array with zero-copy.Added in version 0.10.4:
pickle
protocol support.- __init__(alphabet, name=None, description=None, accession=None, sequence=None, source=None, residue_markups=None)#
Create a new digital-mode sequence with the given attributes.
- Raises:
ValueError – When
sequence
contains digits outside the alphabet symbol range.
Added in version 0.1.4.
Added in version 0.10.4: The
residue_markups
argument.
- __reduce__()#
None
- copy()#
Duplicate the digital sequence, and return the copy.
- reverse_complement(inplace=False)#
Build the reverse complement of the sequence.
- Parameters:
inplace (
bool
) – Whether or not to copy the sequence before computing its reverse complement. WithFalse
(the default), the method will return a copy of the sequence that has been reverse-complemented. WithTrue
, it will reverse-complement inplace and returnNone
.- Raises:
ValueError – When the alphabet of the
DigitalSequence
does not have a complement mapping set (e.g.,Alphabet.amino
).
Caution
The copy made when
inplace
isFalse
is an exact copy, so thename
,description
andaccession
of the copy will be the same. This could lead to duplicates if you’re not careful!Added in version 0.3.0.
- textize()#
Convert the digital sequence to a text sequence.
- Returns:
TextSequence
– A copy of the sequence in text-mode.
Added in version 0.1.4.
- translate(genetic_code=GeneticCode(1))#
Translate the sequence using the given genetic code.
- Parameters:
genetic_code (
GeneticCode
) – The genetic code to use for translating the sequence. If none provided, the default uses the standard translation table (1) and expects DNA sequences.- Returns:
DigitalSequence
– The translation of the input sequence, in digital mode.- Raises:
pyhmmer.errors.AlphabetMismatch – When the
genetic_code
expects a different nucleotide alphabet than the one currently in use to encode the sequence.ValueError – When
sequence
could not be translated properly, because of a codon could not be recognized, or because the sequence has an invalid length.
Note
The translation of a DNA/RNA codon supports ambiguous codons. If the amino acid is unambiguous, despite codon ambiguity, the correct amino acid is still determined:
GGR
translates asGly
,UUY
asPhe
, etc. If there is no single unambiguous amino acid translation, the codon is translated asX
. Ambiguous amino acids (such asJ
orB
) are never produced.Added in version 0.7.2.
- sequence#
The raw sequence digits, as a byte vector.
Note
The internal
ESL_SQ
object allocates a buffer of size \(n+2\) (where \(n\) is the number of residues in the sequence), with the first and the last element of the buffer being sentinel values. This vector does not expose the sentinel values, only the \(n\) elements of the buffer in between.Changed in version 0.4.0: Property is now a
VectorU8
instead of a memoryview.- Type: