Easel¶
High-level interface to the Easel C library.
Easel is a library developed by the Eddy/Rivas Lab to facilitate the development of biological software in C. It is used by HMMER and Infernal.
Data Structures¶
Bitfield¶
- class pyhmmer.easel.Bitfield¶
A statically sized sequence of booleans stored as a packed bitfield.
A bitfield is instantiated with an iterable, and each object will be tested for truth:
>>> bitfield = Bitfield([True, False, False]) >>> len(bitfield) 3 >>> bitfield[0] True >>> bitfield[1] False
Use
Bitfield.zeros
andBitfield.ones
to initialize a bitfield of a given length with all fields set to0
or1
:>>> Bitfield.zeros(4) pyhmmer.easel.Bitfield([False, False, False, False]) >>> Bitfield.ones(4) pyhmmer.easel.Bitfield([True, True, True, True])
Use indexing to access and edit individual bits:
>>> bitfield[0] = True >>> bitfield[0] True >>> bitfield[0] = False >>> bitfield[0] False
- __init__(iterable)¶
Create a new bitfield from an iterable of objects.
Objects yielded by the iterable can be of any type and will be tested for truth before setting the corresponding field.
- Raises
ValueError – When given an empty iterable.
- copy()¶
Return a copy of this bitfield object.
New in version 0.7.0.
- count(value=True)¶
Count the number occurrences of
value
in the bitfield.If no argument is given, counts the number of
True
occurences.Example
>>> bitfield = Bitfield.zeros(8) >>> bitfield.count(False) 8 >>> bitfield[0] = bitfield[1] = True >>> bitfield.count() 2
- toggle(index)¶
Switch the value of one single bit.
Example
>>> bitfield = Bitfield.zeros(8) >>> bitfield[0] False >>> bitfield.toggle(0) >>> bitfield[0] True >>> bitfield.toggle(0) >>> bitfield[0] False
KeyHash¶
- class pyhmmer.easel.KeyHash¶
A dynamically resized container to store byte keys using a hash table.
Internally uses Bob Jenkins’ one at a time hash, a simple and efficient hash function published in 1997 that exhibits avalanche behaviour.
Example
Add new keys to the key hash using the
add
method like you would with a Pythonset
:>>> kh = KeyHash() >>> kh.add(b"key") 0
Check if a key hash contains a given key:
>>> b"key" in kh True >>> b"missing" in kh False
Get the index associated with a key using the indexing notation:
>>> kh[b"key"] 0 >>> kh[b"missing"] Traceback (most recent call last): ... KeyError: b'missing'
Iterate over the keys of the key hash, in the order of insertion:
>>> kh.add(b"key2") 1 >>> for k in kh: ... print(k) b'key' b'key2'
See also
The Wikipedia article for Bob Jenkins’ hash functions: https://en.wikipedia.org/wiki/Jenkins_hash_function
- __init__()¶
Create a new empty key-hash collection.
- add(item)¶
Add a new key to the hash table, and return its index.
If
key
was already in the hash table, the previous index is returned:>>> kh = KeyHash() >>> kh.add(b"first") 0 >>> kh.add(b"second") 1 >>> kh.add(b"first") 0
- Parameters
key (
bytes
) – The key to add to the hash table.- Returns
int
– The index corresponding to the addedkey
.
New in version 0.3.0.
- clear()¶
Remove all entries from the collection.
- copy()¶
Create and return an exact copy of this mapping.
Example
>>> kh = KeyHash() >>> kh.add(b"key") 0 >>> copy = kh.copy() >>> b"key" in copy True
Sequences¶
Sequence¶
- class pyhmmer.easel.Sequence¶
An abstract biological sequence with some associated metadata.
Easel provides two different mode to store a sequence: text, or digital. In the HMMER code, changing from one mode to another mode is done in place, which allows recycling memory. However, doing so can be confusing since there is no way to know statically the representation of a sequence.
To avoid this,
pyhmmer
provides two subclasses of theSequence
abstract class to maintain the mode contract:TextSequence
andDigitalSequence
. Functions expecting sequences in digital format, likepyhmmer.hmmsearch
, can then use Python type system to make sure they receive sequences in the right mode. This allows type checkers such asmypy
to detect potential contract breaches at compile-time.- checksum()¶
Calculate a 32-bit checksum for the sequence.
- clear()¶
Reinitialize the sequence for re-use.
- copy()¶
Duplicate the sequence, and return the copy.
- write(fh)¶
Write the sequence alignement to a file handle, in FASTA format.
- Parameters
fh (
io.IOBase
) – A Python file handle, opened in binary mode.
New in version 0.3.0.
- residue_markups¶
Extra residue markups, mapping information to each position.
Keys and values are not decoded, since they are not necessarily valid UTF-8 bytestrings.
Caution
The values of the dictionary must be the same size as the sequence itself. Trying to set a residue markup of the wrong length will raise a
ValueError
:>>> seq = TextSequence(sequence="TTAATTGGT") >>> seq.residue_markups = {b"quality": b"efcfffffcfee"} Traceback (most recent call last): ... ValueError: Residue markup annotation has an invalid length (expected 9, got 12)
New in version 0.4.6.
- Type
TextSequence¶
- class pyhmmer.easel.TextSequence(Sequence)¶
A biological sequence stored in text mode.
Hint
Use the
sequence
property to access the sequence letters as a Python string.- __init__(name=None, description=None, accession=None, sequence=None, source=None, taxonomy_id=None)¶
Create a new text-mode sequence with the given attributes.
- copy()¶
Duplicate the text sequence, and return the copy.
- digitize(alphabet)¶
Convert the text sequence to a digital sequence using
alphabet
.- Returns
DigitalSequence
– A copy of the sequence in digital mode, digitized withalphabet
.
- reverse_complement(inplace=False)¶
Build the reverse complement of the sequence.
This method assumes that the sequence alphabet is IUPAC/DNA. If the sequence contains any unknown letters, they will be replaced by \(N\) in the reverse-complement.
- Parameters
inplace (
bool
) – Whether or not to copy the sequence before computing its reverse complement. WithFalse
(the default), the method will return a copy of the sequence that has been reverse-complemented. WithTrue
, it will reverse-complement inplace and returnNone
.- Raises
UserWarning – When the sequence contains unknown characters.
Example
>>> seq = TextSequence(sequence="ATGC") >>> seq.reverse_complement().sequence 'GCAT'
Caution
The copy made when
inplace
isFalse
is an exact copy, so thename
,description
andaccession
of the copy will be the same. This could lead to duplicates if you’re not careful!New in version 0.3.0.
DigitalSequence¶
- class pyhmmer.easel.DigitalSequence(Sequence)¶
A biological sequence stored in digital mode.
Hint
Use the
sequence
property to access the sequence digits as a memory view, allowing to access the individual bytes. This can be combined withnumpy.asarray
to get the sequence as an array with zero-copy.- __init__(alphabet, name=None, description=None, accession=None, sequence=None, source=None)¶
Create a new digital-mode sequence with the given attributes.
- Raises
ValueError – When
sequence
contains digits outside the alphabet symbol range.
New in version 0.1.4.
- copy()¶
Duplicate the digital sequence, and return the copy.
- reverse_complement(inplace=False)¶
Build the reverse complement of the sequence.
- Parameters
inplace (
bool
) – Whether or not to copy the sequence before computing its reverse complement. WithFalse
(the default), the method will return a copy of the sequence that has been reverse-complemented. WithTrue
, it will reverse-complement inplace and returnNone
.- Raises
ValueError – When the alphabet of the
DigitalSequence
does not have a complement mapping set (e.g.,Alphabet.amino
).
Caution
The copy made when
inplace
isFalse
is an exact copy, so thename
,description
andaccession
of the copy will be the same. This could lead to duplicates if you’re not careful!New in version 0.3.0.
- textize()¶
Convert the digital sequence to a text sequence.
- Returns
TextSequence
– A copy of the sequence in text-mode.
New in version 0.1.4.
- sequence¶
The raw sequence digits, as a byte vector.
Note
The internal
ESL_SQ
object allocates a buffer of size \(n+2\) (where \(n\) is the number of residues in the sequence), with the first and the last element of the buffer being sentinel values. This vector does not expose the sentinel values, only the \(n\) elements of the buffer in between.Changed in version 0.4.0: Property is now a
VectorU8
instead of a memoryview.- Type
Sequence Blocks¶
SequenceBlock¶
- class pyhmmer.easel.SequenceBlock¶
An abstract container for storing
Sequence
objects.To pass the target sequences efficiently in
Pipeline.search_hmm
, an array is allocated so that the inner loop can iterate over the target sequences without having to acquire the GIL for each new sequence (this gave a huge performance boost in v0.4.5). However, there was no way to reuse this between different queries; some memory recycling was done, but the target sequences had to be indexed for every query. This class allows synchronizing a Pythonlist
ofSequence
objects with an internal C-contiguous buffer of pointers toESL_SQ
structs that can be used in the HMMER search loop.New in version 0.7.0.
- clear()¶
Remove all sequences from the block.
- copy()¶
Return a copy of the sequence block.
Note
The sequence internally refered to by this collection are not copied. Use
copy.deepcopy
if you also want to duplicate the internal storage of each sequence.
- extend(iterable, /)¶
Extend block by appending sequences from the iterable.
- largest()¶
Return the largest sequence in the block.
TextSequenceBlock¶
- class pyhmmer.easel.TextSequenceBlock(SequenceBlock)¶
A container for storing
TextSequence
objects.New in version 0.7.0.
- __init__(*args, **kwargs)¶
- append(sequence)¶
Append
sequence
at the end of the block.
- copy()¶
Return a copy of the text sequence block.
Note
The sequence internally refered to by this collection are not copied. Use
copy.deepcopy
is you also want to duplicate the internal storage of each sequence.
- digitize(alphabet)¶
Create a block containing sequences from this block in digital mode.
- index(sequence, start=0, stop=9223372036854775807)¶
Return the index of the first occurence of
sequence
.- Raises
ValueError – When the block does not contain
sequence
.
- insert(index, sequence)¶
Insert a new sequence in the block before
index
.
- largest()¶
Return the largest sequence in the block.
- pop(index=- 1)¶
Remove and return a sequence from the block (the last one by default).
- remove(sequence)¶
Remove the first occurence of the given sequence.
DigitalSequenceBlock¶
- class pyhmmer.easel.DigitalSequenceBlock(SequenceBlock)¶
A container for storing
DigitalSequence
objects.- alphabet¶
The biological alphabet shared by all sequences in the collection.
- Type
Alphabet
, readonly
New in version 0.7.0.
- __init__()¶
Create a new digital sequence block with the given alphabet.
- Parameters
alphabet (
Alphabet
) – The alphabet to use for all the sequences in the block.iterable (iterable of
DigitalSequence
) – An initial collection of digital sequences to add to the block.
- Raises
AlphabetMismatch – When the alphabet of one of the sequences does not match
alphabet
.
- append(sequence)¶
Append
sequence
at the end of the block.
- copy()¶
Return a copy of the digital sequence block.
Note
The sequence internally refered to by this collection are not copied. Use
copy.deepcopy
is you also want to duplicate the internal storage of each sequence.
- index(sequence, start=0, stop=9223372036854775807)¶
Return the index of the first occurence of
sequence
.- Raises
ValueError – When the block does not contain
sequence
.
- insert(index, sequence)¶
Insert a new sequence in the block before
index
.
- largest()¶
Return the largest sequence in the block.
- pop(index=- 1)¶
Remove and return a sequence from the block (the last one by default).
- remove(sequence)¶
Remove the first occurence of the given sequence.
- textize(alphabet)¶
Create a block containing sequences from this block in text mode.
SequenceFile¶
SequenceFile¶
- class pyhmmer.easel.SequenceFile¶
A wrapper around a sequence file, containing unaligned sequences.
This class supports reading sequences stored in different formats, such as FASTA, GenBank or EMBL. The format of each file can be automatically detected, but it is also possible to pass an explicit format specifier when the
SequenceFile
is instantiated.Hint
SequenceFile
objects can also be used to parse files containing multiple sequence alignments: in that case, the sequences will be read sequentially, removing the gap characters:>>> with SequenceFile("tests/data/msa/LuxC.sto") as sf: ... sequences = sf.read_block() >>> print(sequences[0].name[:6], sequences[0].sequence[:30]) b'Q9KV99' LANQPLEAILGLINEARKSWSSTPELDPYR >>> print(sequences[1].name[:6], sequences[1].sequence[:30]) b'Q2WLE3' IYSYPSEAMIEIINEYSKILCSDRKFLSYE
New in version 0.2.0: The
alphabet
attribute.Changed in version 0.4.8: Support reading sequences from a file-like handle. Support reading individual sequences from an MSA file.
- __init__(file, format=None, *, ignore_gaps=False, digital=True, alphabet=None)¶
Create a new sequence file parser wrapping the given
file
.- Parameters
file (
str
or file-like object) – Either the path to a file containing the sequences to read, or a file-like object opened in binary mode.format (
str
, optional) – The format of the file, orNone
to autodetect. Supported values are:fasta
,embl
,genbank
,ddbj
,uniprot
,ncbi
,daemon
,hmmpgmd
,fmindex
, plus any format also supported byMSAFile
.
- Keyword Arguments
ignore_gaps (
bool
) – When set toTrue
, allow ignoring gap characters (‘-’) when they are present in ungapped formats such asfasta
. WithFalse
, stick to the default Easel behaviour.digital (
bool
) – Whether to read the sequences in text or digital mode. This will affect the type ofSequence
objects returned later by theread
function.alphabet (
Alphabet
) – The alphabet to use to digitize the sequences while reading. IfNone
given, it will be guessed based on the contents of the first sequence.
- Raises
ValueError – When
format
is not a valid sequence format.OSError – If an internal parser error occurred while guessing the alphabet or the format.
Caution
SequenceFile
can generally read sequences from binary-mode file-like objects, except for sequences in an NCBI BLAST database, since it is composed of multiple files. Reading from an NCBI BLAST database passed from a filename is however supported.Changed in version 0.4.4: Added the
ignore_gaps
parameter.Changed in version 0.4.8: Support reading from a file-like object (except NCBI format).
Changed in version 0.5.0: Added the
digital
andalphabet
keyword arguments.Deprecated since version 0.6.0: The
ignore_gaps
keyword argument, useafa
format instead.
- close()¶
Close the file and free the resources used by the parser.
- guess_alphabet()¶
Guess the alphabet of an open
SequenceFile
.This method tries to guess the alphabet of a sequence file by inspecting the first sequence in the file. It returns the alphabet, or
None
if the file alphabet cannot be reliably guessed.- Raises
EOFError – if the file is empty.
OSError – if a parse error occurred.
ValueError – if this methods is called on a closed file.
Example
>>> with SequenceFile("tests/data/seqs/bmyD.fna") as sf: ... sf.guess_alphabet() pyhmmer.easel.Alphabet.dna() >>> with SequenceFile("tests/data/seqs/LuxC.faa") as sf: ... sf.guess_alphabet() pyhmmer.easel.Alphabet.amino()
New in version 0.6.3.
- parse(buffer, format)¶
Parse a sequence from a binary
buffer
using the givenformat
.- Argument:
- buffer (
bytes
or byte-like buffer): A buffer containing the sequence data to parse. Any type implementing the buffer protocol (such as
bytes
,bytearray
, ormemoryview
) is supported.- format (
str
): The format of the sequence data. See the SequenceFile.__init__
documentation for allowed values.
- buffer (
- Keyword Arguments
alphabet (
Alphabet
) – The alphabet to use to digitize the returned sequence, if desired.- Returns
Sequence
– The sequenced parsed from the buffer, either as aDigitalSequence
if an alphabet was provided, or as aTextSequence
ifNone
was given.- Raises
ValueError – When
format
is not a valid sequence format.OSError – If an internal parser error occurred while guessing the alphabet or the format.
- parseinto(seq, buffer, format)¶
Parse a sequence from a binary
buffer
intoseq
.- Argument:
- seq (
Sequence
): The sequence object into which the deseriazlied sequence data will be written.
- buffer (
bytes
or byte-like buffer): A buffer containing the sequence data to parse. Any type implementing the buffer protocol (such as
bytes
,bytearray
, ormemoryview
) is supported.- format (
str
): The format of the sequence data. See the SequenceFile.__init__
documentation for allowed values.
- seq (
- Raises
ValueError – When
format
is not a valid sequence format.OSError – If an internal parser error occurred while guessing the alphabet or the format.
- Returns
Sequence
– The sequence given as argument, orNone
if the end of the file was reached.
- read(skip_info=False, skip_sequence=False)¶
Read the next sequence from the file.
- Parameters
- Returns
Sequence
– The next sequence in the file, orNone
if all sequences were read from the file.- Raises
ValueError – When attempting to read a sequence from a closed file, or when the file could not be parsed.
Hint
This method allocates a new sequence, which is not efficient in case the sequences are being read within a tight loop. Use
SequenceFile.readinto
with an already initializedSequence
if you can to recycle the internal buffers.
- read_block(max_sequences=None, max_residues=None)¶
Read several sequences into a sequence block.
- Parameters
sequences (
int
, optional) – The maximum number of sequences to read before returning a block. Leave asNone
to read all remaining sequences from the file.residues (
int
, optional) – The number of residues to read before returning the block. Leave asNone
to keep reading sequences without a residue limit.
- Returns
SequenceBlock
– A sequence block object, which may be empty if there are no sequences to read anymore. The concrete type depends on whether theSequenceFile
was opened in text or digital mode.- Raises
ValueError – When attempting to read a sequence from a closed file, or when the file could not be parsed.
Example
Read a block of at most 4 sequences from a sequence file:
>>> with SequenceFile("tests/data/seqs/LuxC.faa") as sf: ... block = sf.read_block(sequences=4) >>> len(block) 4
Read sequences until the block contains at least 1000 residues:
>>> with SequenceFile("tests/data/seqs/LuxC.faa") as sf: ... block = sf.read_block(residues=1000) >>> len(block) 3 >>> len(block[0]) + len(block[1]) + len(block[2]) 1444
Note that the last sequence will not be truncated, so the block will always contain more than
max_residues
unless the end of the file was reached.New in version 0.7.0.
- readinto(seq, skip_info=False, skip_sequence=False)¶
Read the next sequence from the file, using
seq
to store data.- Parameters
seq (
Sequence
) – A sequence object to use to store the next entry in the file. If this sequence was used before, it must be properly reset (using theSequence.clear
method) before using it again withreadinto
.skip_info (
bool
) – PassTrue
to disable reading the sequence metadata, and only read the sequence letters. Defaults toFalse
.skip_sequence (
bool
) – PassTrue
to disable reading the sequence letters, and only read the sequence metadata. Defaults toFalse
.
- Returns
Sequence
– A reference toseq
that was passed as an input, orNone
if no sequences are left in the file.- Raises
ValueError – When attempting to read a sequence from a closed file, or when the file could not be parsed.
Example
Use
SequenceFile.readinto
to loop over the sequences in a file while recycling the sameSequence
buffer:>>> with SequenceFile("tests/data/seqs/LuxC.faa") as sf: ... seq = TextSequence() ... while sf.readinto(seq) is not None: ... # ... process seq here ... # ... seq.clear()
- rewind()¶
Rewind the file back to the beginning.
For sequential formats, this method is supported for both path-based and file object-based sequence files. For multiple-sequence alignment formats, the underlying
MSAFile
needs to be reopened, so this is only supported for path-based files.- Raises
io.UnsupportedOperation – When attempting to rewind a sequence file where the underlying stream is a file-like object that does not support the
seek
method.
- closed¶
Whether the
SequenceFile
is closed or not.- Type
- digital¶
Whether the
SequenceFile
is in digital mode or not.New in version 0.5.0.
- Type
- format¶
The format of the
SequenceFile
.- Type
Alignments¶
MSA¶
- class pyhmmer.easel.MSA¶
An abstract alignment of multiple sequences.
Hint
Use
len(msa)
to get the number of columns in the alignment, andlen(msa.sequences)
to get the number of sequences (i.e. the number of rows).- checksum()¶
Calculate a 32-bit checksum for the multiple sequence alignment.
- write(fh, format)¶
Write the multiple sequence alignement to a file handle.
- Parameters
New in version 0.3.0.
- names¶
The name of each sequence in the alignment.
Every sequence in the alignment is required to have a name, so no member of the
tuple
will ever beNone
.Example
>>> s1 = TextSequence(name=b"seq1", sequence="ATGC") >>> s2 = TextSequence(name=b"seq2", sequence="ATGC") >>> msa = TextMSA(name=b"msa", sequences=[s1, s2]) >>> msa.names (b'seq1', b'seq2')
New in version 0.4.8.
TextMSA¶
- class pyhmmer.easel.TextMSA(MSA)¶
A multiple sequence alignement stored in text mode.
- __init__(name=None, description=None, accession=None, sequences=None, author=None)¶
Create a new text-mode alignment with the given
sequences
.- Parameters
name (
bytes
, optional) – The name of the alignment, if any.description (
bytes
, optional) – The description of the alignment, if any.accession (
bytes
, optional) – The accession of the alignment, if any.sequences (collection of
TextSequence
) – The sequences to store in the multiple sequence alignment. All sequences must have the same length. They also need to have distinct names.author (
bytes
, optional) – The author of the alignment, often used to record the aligner it was created with.
- Raises
ValueError – When the alignment cannot be created from the given sequences.
TypeError – When
sequences
is not an iterable ofTextSequence
objects.
Example
>>> s1 = TextSequence(name=b"seq1", sequence="ATGC") >>> s2 = TextSequence(name=b"seq2", sequence="ATGC") >>> msa = TextMSA(name=b"msa", sequences=[s1, s2]) >>> len(msa) 4
Changed in version 0.3.0: Allow creating an alignment from an iterable of
TextSequence
.
- copy()¶
Duplicate the text sequence alignment, and return the copy.
- digitize(alphabet)¶
Convert the text alignment to a digital alignment using
alphabet
.- Returns
DigitalMSA
– An alignment in digital mode containing the same sequences digitized withalphabet
.- Raises
ValueError – When the text sequence contains invalid characters that cannot be converted according to
alphabet.symbols
.
- alignment¶
A view of the aligned sequences as strings.
This property gives access to the aligned sequences, including gap characters, so that they can be displayed or processed column by column.
Examples
Use
TextMSA.alignment
to display an alignment in text format:>>> for name, aligned in zip(luxc.names, luxc.alignment): ... print(name, " ", aligned[:40], "...") b'Q9KV99.1' LANQPLEAILGLINEARKSWSST------------PELDP ... b'Q2WLE3.1' IYSYPSEAMIEIINEYSKILCSD------------RKFLS ... b'Q97GS8.1' VHDIKTEETIDLLDRCAKLWLDDNYSKK--HIETLAQITN ... b'Q3WCI9.1' LLNVPLKEIIDFLVETGERIRDPRNTFMQDCIDRMAGTHV ... b'P08639.1' LNDLNINNIINFLYTTGQRWKSEEYSRRRAYIRSLITYLG ... ...
Use the splat operator (*) in combination with the
zip
builtin to iterate over the columns of an alignment:>>> for idx, col in enumerate(zip(*luxc.alignment)): ... print(idx+1, col) 1 ('L', 'I', 'V', 'L', 'L', ...) 2 ('A', 'Y', 'H', 'L', 'N', ...) ...
New in version 0.4.8.
- sequences¶
A view of the sequences in the alignment.
This property lets you access the individual sequences in the multiple sequence alignment as
TextSequence
instances.Example
Query the number of sequences in the alignment with
len
, or access individual members via indexing notation:>>> s1 = TextSequence(name=b"seq1", sequence="ATGC") >>> s2 = TextSequence(name=b"seq2", sequence="ATGC") >>> msa = TextMSA(name=b"msa", sequences=[s1, s2]) >>> len(msa.sequences) 2 >>> msa.sequences[0].name b'seq1'
Caution
Sequences in the list are copies, so editing their attributes will have no effect on the alignment:
>>> msa.sequences[0].name b'seq1' >>> msa.sequences[0].name = b"seq1bis" >>> msa.sequences[0].name b'seq1'
Support for this feature may be added in a future version, but can be circumvented for now by forcingly setting the updated version of the object:
>>> seq = msa.sequences[0] >>> seq.name = b"seq1bis" >>> msa.sequences[0] = seq >>> msa.sequences[0].name b'seq1bis'
New in version 0.3.0.
- Type
_TextMSASequences
DigitalMSA¶
- class pyhmmer.easel.DigitalMSA(MSA)¶
A multiple sequence alignment stored in digital mode.
- __init__(alphabet, name=None, description=None, accession=None, sequences=None, author=None)¶
Create a new digital-mode alignment with the given
sequences
.- Parameters
alphabet (
Alphabet
) – The alphabet of the alignmed sequences.name (
bytes
, optional) – The name of the alignment, if any.description (
bytes
, optional) – The description of the alignment, if any.accession (
bytes
, optional) – The accession of the alignment, if any.sequences (iterable of
DigitalSequence
) – The sequences to store in the multiple sequence alignment. All sequences must have the same length and alphabet. They also need to have distinct names set.author (
bytes
, optional) – The author of the alignment, often used to record the aligner it was created with.
Changed in version 0.3.0: Allow creating an alignment from an iterable of
DigitalSequence
.
- copy()¶
Duplicate the digital sequence alignment, and return the copy.
- textize()¶
Convert the digital alignment to a text alignment.
- Returns
TextMSA
– A copy of the alignment in text-mode.
New in version 0.3.0.
- sequences¶
A view of the sequences in the alignment.
This property lets you access the individual sequences in the multiple sequence alignment as
DigitalSequence
instances.See also
The documentation for the
TextMSA.sequences
property, which contains some additional information.New in version 0.3.0.
- Type
_DigitalMSASequences
MSAFile¶
- class pyhmmer.easel.MSAFile¶
A wrapper around a multiple-alignment file.
This class supports reading sequences stored in different formats, such as Stockholm, A2M, PSI-BLAST or Clustal.
- name¶
The name of the MSA file, if it was created from a filename, or
None
if it wraps a file-like object.- Type
str
, optional
Hint
Some Clustal files created by alignment tools other than Clustal (such as MUSCLE or MAFFT, for instance), may not contain the header expected by Easel for the Clustal format. If you get an error while trying to parse these files, use the
"clustallike"
format instead of the"clustal"
format when creating theMSAFile
.- __init__(file, format=None, *, digital=False, alphabet=None)¶
Create a new MSA file parser wrapping the given
file
.- Parameters
file (
str
or file-like object) – Either the path to a file containing the sequences to read, or a file-like object opened in binary mode.format (
str
, optional) – The format of the file, orNone
to autodetect. Supported values are:stockholm
,pfam
,a2m
,psiblast
,selex
,afa
(aligned FASTA),clustal
,clustallike
,phylip
,phylips
.
- Keyword Arguments
digital (
bool
) – Whether to read the sequences in text or digital mode. This will affect the type ofMSA
objects returned later by theread
function.alphabet (
Alphabet
) – The alphabet to use to digitize the sequences while reading. IfNone
given, it will be guessed based on the contents of the first sequence.
- Raises
ValueError – When
format
is not a valid MSA format.
Changed in version 0.4.8: Support reading from a file-like object.
Changed in version 0.5.0: Added the
digital
andalphabet
keyword arguments.
- close()¶
Close the file and free the resources used by the parser.
- guess_alphabet()¶
Guess the alphabet of an open
MSAFile
.This method tries to guess the alphabet of a multiple-alignment file by inspecting the first entry in the file. It returns the alphabet, or
None
if the file alphabet cannot be reliably guessed.- Raises
EOFError – if the file is empty.
OSError – if a parse error occurred.
ValueError – if this methods is called after the file was closed.
Example
>>> with MSAFile("tests/data/msa/laccase.clw") as mf: ... mf.guess_alphabet() pyhmmer.easel.Alphabet.amino()
- read()¶
Read the next alignment from the file.
- Returns
MSA
– The next alignment in the file, orNone
if all the alignments were read from the file already.- Raises
ValueError – When attempting to read an alignment from a closed file, or when the file could not be parsed.
Linear Algebra¶
Vector¶
- class pyhmmer.easel.Vector¶
An abstract 1D array of fixed size.
New in version 0.4.0.
- argmax()¶
Return index of the maximum element in the vector.
- Raises
ValueError – When called on an empty vector.
- argmin()¶
Return index of the minimum element in the vector.
- Raises
ValueError – When called on an empty vector.
- copy()¶
Create a copy of the vector, allocating a new buffer.
- max()¶
Return value of the maximum element in the vector.
- Raises
ValueError – When called on an empty vector.
- min()¶
Return value of the minimum element in the vector.
- Raises
ValueError – When called on an empty vector.
- reverse()¶
Reverse the vector, in place.
- sum()¶
Returns the scalar sum of all elements in the vector.
- zeros(n)¶
Create a vector of size
n
filled with zeros.
VectorF¶
- class pyhmmer.easel.VectorF¶
A vector storing single-precision floating point numbers.
Individual elements of a vector can be accessed and modified with the usual indexing notation:
>>> v = VectorF([1.0, 2.0, 3.0]) >>> v[0] 1.0 >>> v[-1] 3.0 >>> v[0] = v[-1] = 4.0 >>> v pyhmmer.easel.VectorF([4.0, 2.0, 4.0])
Slices are also supported, and they do not copy data (use the
copy
method to allocate a new vector):>>> v = VectorF(range(6)) >>> v[2:5] pyhmmer.easel.VectorF([2.0, 3.0, 4.0]) >>> v[2:-1] = 10.0 >>> v pyhmmer.easel.VectorF([0.0, 1.0, 10.0, 10.0, 10.0, 5.0])
Addition and multiplication is supported for scalars, in place or not:
>>> v = VectorF([1.0, 2.0, 3.0]) >>> v += 1 >>> v pyhmmer.easel.VectorF([2.0, 3.0, 4.0]) >>> v * 3 pyhmmer.easel.VectorF([6.0, 9.0, 12.0])
Pairwise operations can also be performed, but only on vectors of the same dimension and precision:
>>> v = VectorF([1.0, 2.0, 3.0]) >>> v * v pyhmmer.easel.VectorF([1.0, 4.0, 9.0]) >>> v += VectorF([3.0, 4.0, 5.0]) >>> v pyhmmer.easel.VectorF([4.0, 6.0, 8.0]) >>> v *= VectorF([1.0]) Traceback (most recent call last): ... ValueError: cannot pairwise multiply vectors of different sizes
Objects of this type support the buffer protocol, and can be viewed as a
numpy.ndarray
of one dimension using thenumpy.asarray
function, and can be passed without copy to mostnumpy
functions:>>> v = VectorF([1.0, 2.0, 3.0]) >>> numpy.asarray(v) array([1., 2., 3.], dtype=float32) >>> numpy.log2(v) array([0. , 1. , 1.5849625], dtype=float32)
New in version 0.4.0.
- __init__(*args, **kwargs)¶
- argmax()¶
Return index of the maximum element in the vector.
- Raises
ValueError – When called on an empty vector.
- argmin()¶
Return index of the minimum element in the vector.
- Raises
ValueError – When called on an empty vector.
- copy()¶
Create a copy of the vector, allocating a new buffer.
- entropy()¶
Compute the Shannon entropy of the vector.
The Shannon entropy of a probability vector is defined as:
\[H = \sum_{i=0}^{N}{\log_2 p_i}\]Example
>>> easel.VectorF([0.1, 0.1, 0.3, 0.5]).entropy() 1.6854... >>> easel.VectorF([0.25, 0.25, 0.25, 0.25]).entropy() 2.0
References
Cover, Thomas M., and Thomas, Joy A. Entropy, Relative Entropy, and Mutual Information. In Elements of Information Theory, 13–55. Wiley (2005): 2. doi:10.1002/047174882X.ch2 ISBN:9780471241959.
New in version 0.4.10.
- max()¶
Return value of the maximum element in the vector.
- Raises
ValueError – When called on an empty vector.
- min()¶
Return value of the minimum element in the vector.
- Raises
ValueError – When called on an empty vector.
- normalize()¶
Normalize a vector so that all elements sum to 1.
Caution
If sum is zero, sets all elements to \(\frac{1}{n}\), where \(n\) is the size of the vector.
- relative_entropy(other)¶
Compute the relative entropy between two probability vectors.
The Shannon relative entropy of two probability vectors \(p\) and \(q\), also known as the Kullback-Leibler divergence, is defined as:
\[D(p \parallel q) = \sum_i p_i \log_2 \frac{p_i}{q_i}.\]with \(D(p \parallel q) = \infty\) per definition if \(q_i = 0\) and \(p_i > 0\) for any \(i\).
Example
>>> v1 = easel.VectorF([0.1, 0.1, 0.3, 0.5]) >>> v2 = easel.VectorF([0.25, 0.25, 0.25, 0.25]) >>> v1.relative_entropy(v2) 0.3145... >>> v2.relative_entropy(v1) # this is no symmetric relation 0.3452...
References
Cover, Thomas M., and Thomas, Joy A. Entropy, Relative Entropy, and Mutual Information. In Elements of Information Theory, 13–55. Wiley (2005): 2. doi:10.1002/047174882X.ch2 ISBN:9780471241959.
New in version 0.4.10.
- reverse()¶
Reverse the vector, in place.
- sum()¶
Returns the scalar sum of all elements in the vector.
Float summations use Kahan’s algorithm, in order to minimize roundoff error accumulation. Additionally, they are most accurate if the vector is sorted in increasing order, from small to large, so you may consider sorting the vector before summing it.
References
Kahan, W. Pracniques: Further Remarks on Reducing Truncation Errors. Communications of the ACM 8, no. 1 (1 January 1965): 40. doi:10.1145/363707.363723.
- format¶
- itemsize¶
VectorU8¶
- class pyhmmer.easel.VectorU8¶
A vector storing byte-sized unsigned integers.
New in version 0.4.0.
- __init__(*args, **kwargs)¶
- argmax()¶
Return index of the maximum element in the vector.
- Raises
ValueError – When called on an empty vector.
- argmin()¶
Return index of the minimum element in the vector.
- Raises
ValueError – When called on an empty vector.
- copy()¶
Create a copy of the vector, allocating a new buffer.
- max()¶
Return value of the maximum element in the vector.
- Raises
ValueError – When called on an empty vector.
- min()¶
Return value of the minimum element in the vector.
- Raises
ValueError – When called on an empty vector.
- reverse()¶
Reverse the vector, in place.
- sum()¶
Returns the scalar sum of all elements in the vector.
Caution
The sum is wrapping:
>>> vec = VectorU8([255, 2]) >>> vec.sum() 1
- format¶
- itemsize¶
Matrix¶
- class pyhmmer.easel.Matrix¶
An abstract 2D array of fixed size.
New in version 0.4.0.
- argmax()¶
Return the coordinates of the maximum element in the matrix.
- Raises
ValueError – When called on an empty matrix.
- argmin()¶
Return the coordinates of the minimum element in the matrix.
- Raises
ValueError – When called on an empty matrix.
- copy()¶
Create a copy of the matrix, allocating a new buffer.
- max()¶
Return the value of the maximum element in the matrix.
- Raises
ValueError – When called on an empty matrix.
- min()¶
Return the value of the minimum element in the matrix.
- Raises
ValueError – When called on an empty matrix.
- sum()¶
Return the sum of all elements in the matrix.
- zeros(m, n)¶
Create a new \(m \times n\) matrix filled with zeros.
- format¶
The format of each item in the matrix.
See also
The
array
module of the Python standard library for a detail about available type codes.New in version 0.4.7.
- Type
- shape¶
The shape of the matrix.
Example
>>> m = MatrixF([ [1.0, 2.0], [3.0, 4.0], [5.0, 6.0] ]) >>> m.shape (3, 2)
- Type
MatrixF¶
- class pyhmmer.easel.MatrixF¶
A matrix storing single-precision floating point numbers.
Use indexing notation to access and edit individual elements of the matrix:
>>> m = MatrixF.zeros(2, 2) >>> m[0, 0] = 3.0 >>> m pyhmmer.easel.MatrixF([[3.0, 0.0], [0.0, 0.0]])
Indexing can also be performed at the row-level to get a
VectorF
without copying the underlying data:>>> m = MatrixF([ [1.0, 2.0], [3.0, 4.0] ]) >>> m[0] pyhmmer.easel.VectorF([1.0, 2.0])
Objects of this type support the buffer protocol, and can be viewed as a
numpy.ndarray
with two dimensions using thenumpy.asarray
function, and can be passed without copy to mostnumpy
functions:>>> m = MatrixF([ [1.0, 2.0], [3.0, 4.0] ]) >>> numpy.asarray(m) array([[1., 2.], [3., 4.]], dtype=float32) >>> numpy.log2(m) array([[0. , 1. ], [1.5849625, 2. ]], dtype=float32)
New in version 0.4.0.
- __init__(*args, **kwargs)¶
- argmax()¶
Return the coordinates of the maximum element in the matrix.
- Raises
ValueError – When called on an empty matrix.
- argmin()¶
Return the coordinates of the minimum element in the matrix.
- Raises
ValueError – When called on an empty matrix.
- copy()¶
Create a copy of the matrix, allocating a new buffer.
- max()¶
Return the value of the maximum element in the matrix.
- Raises
ValueError – When called on an empty matrix.
- min()¶
Return the value of the minimum element in the matrix.
- Raises
ValueError – When called on an empty matrix.
- sum()¶
Return the sum of all elements in the matrix.
- format¶
- itemsize¶
MatrixU8¶
- class pyhmmer.easel.MatrixU8¶
A matrix storing byte-sized unsigned integers.
New in version 0.4.0.
- __init__(*args, **kwargs)¶
- argmax()¶
Return the coordinates of the maximum element in the matrix.
- Raises
ValueError – When called on an empty matrix.
- argmin()¶
Return the coordinates of the minimum element in the matrix.
- Raises
ValueError – When called on an empty matrix.
- copy()¶
Create a copy of the matrix, allocating a new buffer.
- max()¶
Return the value of the maximum element in the matrix.
- Raises
ValueError – When called on an empty matrix.
- min()¶
Return the value of the minimum element in the matrix.
- Raises
ValueError – When called on an empty matrix.
- sum()¶
Return the sum of all elements in the matrix.
- format¶
- itemsize¶
Miscellaneous¶
Alphabet¶
- class pyhmmer.easel.Alphabet¶
A biological alphabet, including additional marker symbols.
This type is used to share an alphabet to several objects in the
easel
andplan7
modules. Reference counting helps sharing the same instance everywhere, instead of reallocating memory every time an alphabet is needed.Use the factory class methods to obtain a default
Alphabet
for one of the three standard biological alphabets:>>> dna = Alphabet.dna() >>> rna = Alphabet.rna() >>> aa = Alphabet.amino()
- amino()¶
Create a default amino-acid alphabet.
- decode(sequence)¶
Decode a digital sequence into its textual representation.
- Parameters
sequence (
VectorU8
) – A raw sequence in digital format.- Returns
str
– A raw sequence in textual format.
Example
>>> alphabet = easel.Alphabet.amino() >>> dseq = easel.VectorU8([0, 4, 2, 17, 3, 13, 0, 0, 5]) >>> alphabet.decode(dseq) 'AFDVEQAAG'
New in version 0.6.3.
- dna()¶
Create a default DNA alphabet.
- encode(sequence)¶
Encode a text sequence into its digital representation.
- Parameters
sequence (
str
) – A raw sequence in text format.- Returns
VectorU8
– A raw sequence in digital format.
Example
>>> alphabet = easel.Alphabet.dna() >>> alphabet.encode("ACGT") pyhmmer.easel.VectorU8([0, 1, 2, 3])
New in version 0.6.3.
- rna()¶
Create a default RNA alphabet.
- K¶
The alphabet size, counting only actual alphabet symbols.
Example
>>> Alphabet.dna().K 4 >>> Alphabet.amino().K 20
- Type
- Kp¶
The complete alphabet size, including marker symbols.
Example
>>> Alphabet.dna().Kp 18 >>> Alphabet.amino().Kp 29
- Type
Randomness¶
- class pyhmmer.easel.Randomness¶
A portable, thread-safe random number generator.
Methods with an implementation in Easel are named after the equivalent methods of
random.Random
.New in version 0.4.2.
- __init__(seed=None, fast=False)¶
Create a new random number generator with the given seed.
- Parameters
seed (
int
) – The seed to initialize the generator with. If0
orNone
is given, an arbitrary seed will be chosen using the system clock.fast (
bool
) – IfTrue
, use a linear congruential generator (LCG), which is low quality and should only be used for integration with legacy code. WithFalse
, use the Mersenne Twister MT19937 algorithm instead.
- copy()¶
Return a copy of the random number generator in the same exact state.
- getstate()¶
Get a tuple containing the current state.
- normalvariate(mu, sigma)¶
Generate a Gaussian-distributed sample.
- random()¶
Generate a uniform random deviate on \(\left[ 0, 1 \right)\).
- seed()¶
Reinitialize the random number generator with the given seed.
- setstate(state)¶
Restores the state of the random number generator.
Sequence / Subsequence Index¶
- class pyhmmer.easel.SSIReader¶
A read-only handler for sequence/subsequence index file.
- class Entry(fd, record_offset, data_offset, record_length)¶
- property data_offset¶
Alias for field number 2
- property fd¶
Alias for field number 0
- property record_length¶
Alias for field number 3
- property record_offset¶
Alias for field number 1
- class FileInfo(name, format)¶
- property format¶
Alias for field number 1
- property name¶
Alias for field number 0
- __init__(file)¶
Create a new SSI file reader for the file at the given location.
- Parameters
file (
str
) – The path to a sequence/subsequence index file to read.
- close()¶
Close the SSI file reader.
- class pyhmmer.easel.SSIWriter¶
A writer for sequence/subsequence index files.
- __init__(file)¶
Create a new SSI file write for the file at the given location.
- Parameters
- Raises
FileNotFoundError – When the path to the file cannot be resolved.
FileExistsError – When the file exists and
exclusive
isTrue
.
- add_alias(alias, key)¶
Make
alias
an alias ofkey
in the index.
- add_file(filename, format=0)¶
Add a new file to the index.
- add_key(key, fd, record_offset, data_offset=0, record_length=0)¶
Add a new entry to the index with the given
key
.
- close()¶
Close the SSI file writer.