Easel¶
High-level interface to the Easel C library.
Easel is a library developed by the Eddy/Rivas Lab to facilitate the development of biological software in C. It is used by HMMER and Infernal.
Alphabet¶
-
class
pyhmmer.easel.
Alphabet
¶ A biological alphabet, including additional marker symbols.
This type is used to share an alphabet to several objects in the
easel
andplan7
modules. Reference counting helps sharing the same instance everywhere, instead of reallocating memory every time an alphabet is needed.Use the factory class methods to obtain a default
Alphabet
for one of the three standard biological alphabets:>>> dna = Alphabet.dna() >>> rna = Alphabet.rna() >>> aa = Alphabet.amino()
-
amino
()¶ Create a default amino-acid alphabet.
-
dna
()¶ Create a default DNA alphabet.
-
rna
()¶ Create a default RNA alphabet.
-
K
¶ The alphabet size, counting only actual alphabet symbols.
Example
>>> Alphabet.dna().K 4 >>> Alphabet.amino().K 20
- Type
-
Kp
¶ The complete alphabet size, including marker symbols.
Example
>>> Alphabet.dna().Kp 18 >>> Alphabet.amino().Kp 29
- Type
-
Bitfield¶
-
class
pyhmmer.easel.
Bitfield
¶ A statically sized sequence of booleans stored as a packed bitfield.
A bitfield is instantiated with a fixed length, and all booleans are set to
False
by default:>>> bitfield = Bitfield(8) >>> len(bitfield) 8 >>> bitfield[0] False
Use indexing to access and edit individual bits:
>>> bitfield[0] = True >>> bitfield[0] True >>> bitfield[0] = False >>> bitfield[0] False
-
__init__
(length)¶ Create a new bitfield with the given
length
.
-
count
(value=True)¶ Count the number occurrences of
value
in the bitfield.If no argument is given, counts the number of
True
occurences.Example
>>> bitfield = Bitfield(8) >>> bitfield.count(False) 8 >>> bitfield[0] = bitfield[1] = True >>> bitfield.count() 2
-
toggle
(index)¶ Switch the value of one single bit.
Example
>>> bitfield = Bitfield(8) >>> bitfield[0] False >>> bitfield.toggle(0) >>> bitfield[0] True >>> bitfield.toggle(0) >>> bitfield[0] False
-
KeyHash¶
-
class
pyhmmer.easel.
KeyHash
¶ A dynamically resized container to store byte keys using a hash table.
Internally uses Bob Jenkins’ one at a time hash, a simple and efficient hash function published in 1997 that exhibits avalanche behaviour.
Example
Add new keys to the key hash using the
add
method like you would with a Pythonset
:>>> kh = KeyHash() >>> kh.add(b"key") 0
Check if a key hash contains a given key:
>>> b"key" in kh True >>> b"missing" in kh False
Get the index associated with a key using the indexing notation:
>>> kh[b"key"] 0 >>> kh[b"missing"] Traceback (most recent call last): ... KeyError: b'missing'
See also
The Wikipedia article for Bob Jenkins’ hash functions: https://en.wikipedia.org/wiki/Jenkins_hash_function
-
__init__
()¶ Create a new empty key-hash collection.
-
add
(item)¶ Add a new key to the hash table, and return its index.
If
key
was already in the hash table, the previous index is returned:>>> kh = KeyHash() >>> kh.add(b"first") 0 >>> kh.add(b"second") 1 >>> kh.add(b"first") 0
- Parameters
key (
bytes
) – The key to add to the hash table.- Returns
int
– The index corresponding to the addedkey
.
New in version 0.3.0.
-
clear
()¶ Remove all entries from the collection.
-
copy
()¶ Create and return an exact copy of this mapping.
Example
>>> kh = KeyHash() >>> kh.add(b"key") 0 >>> copy = kh.copy() >>> b"key" in copy True
-
Multiple Sequence Alignment¶
-
class
pyhmmer.easel.
MSA
¶ An abstract alignment of multiple sequences.
Hint
Use
len(msa)
to get the number of columns in the alignment, andlen(msa.sequences)
to get the number of sequences (i.e. the number of rows).-
checksum
()¶ Calculate a 32-bit checksum for the multiple sequence alignment.
-
write
(fh, format)¶ Write the multiple sequence alignement to a file handle.
- Parameters
New in version 0.3.0.
-
-
class
pyhmmer.easel.
TextMSA
(MSA)¶ A multiple sequence alignement stored in text mode.
-
__init__
(name=None, description=None, accession=None, sequences=None, author=None)¶ Create a new text-mode alignment with the given
sequences
.- Parameters
name (
bytes
, optional) – The name of the alignment, if any.description (
bytes
, optional) – The description of the alignment, if any.accession (
bytes
, optional) – The accession of the alignment, if any.sequences (iterable of
TextSequence
) – The sequences to store in the multiple sequence alignment. All sequences must have the same length. They also need to have distinct names.author (
bytes
, optional) – The author of the alignment, often used to record the aligner it was created with.
- Raises
ValueError – When the alignment cannot be created from the given sequences.
TypeError – When
sequences
is not an iterable ofTextSequence
objects.
Example
>>> s1 = TextSequence(name=b"seq1", sequence="ATGC") >>> s2 = TextSequence(name=b"seq2", sequence="ATGC") >>> msa = TextMSA(name=b"msa", sequences=[s1, s2]) >>> len(msa) 4
Changed in version 0.3.0: Allow creating an alignment from an iterable of
TextSequence
.
-
copy
()¶ Duplicate the text sequence alignment, and return the copy.
-
digitize
(alphabet)¶ Convert the text alignment to a digital alignment using
alphabet
.- Returns
DigitalMSA
– An alignment in digital mode containing the same sequences digitized withalphabet
.
-
sequences
¶ A view of the sequences in the alignment.
This property lets you access the individual sequences in the multiple sequence alignment as
TextSequence
instances.Example
Query the number of sequences in the alignment with
len
, or access individual members via indexing notation:>>> s1 = TextSequence(name=b"seq1", sequence="ATGC") >>> s2 = TextSequence(name=b"seq2", sequence="ATGC") >>> msa = TextMSA(name=b"msa", sequences=[s1, s2]) >>> len(msa.sequences) 2 >>> msa.sequences[0].name b'seq1'
Caution
Sequences in the list are copies, so editing their attributes will have no effect on the alignment:
>>> msa.sequences[0].name b'seq1' >>> msa.sequences[0].name = b"seq1bis" >>> msa.sequences[0].name b'seq1'
Support for this feature will be added in a future version, but can be circumvented for now by forcingly setting the updated version of the object:
>>> seq = msa.sequences[0] >>> seq.name = b"seq1bis" >>> msa.sequences[0] = seq >>> msa.sequences[0].name b'seq1bis'
New in version 0.3.0.
- Type
_TextMSASequences
-
-
class
pyhmmer.easel.
DigitalMSA
(MSA)¶ A multiple sequence alignment stored in digital mode.
-
__init__
(alphabet, name=None, description=None, accession=None, sequences=None, author=None)¶ Create a new digital-mode alignment with the given
sequences
.- Parameters
alphabet (
Alphabet
) – The alphabet of the alignmed sequences.name (
bytes
, optional) – The name of the alignment, if any.description (
bytes
, optional) – The description of the alignment, if any.accession (
bytes
, optional) – The accession of the alignment, if any.sequences (iterable of
DigitalSequence
) – The sequences to store in the multiple sequence alignment. All sequences must have the same length and alphabet. They also need to have distinct names set.author (
bytes
, optional) – The author of the alignment, often used to record the aligner it was created with.
Changed in version 0.3.0: Allow creating an alignment from an iterable of
DigitalSequence
.
-
copy
()¶ Duplicate the digital sequence alignment, and return the copy.
-
textize
()¶ Convert the digital alignment to a text alignment.
- Returns
TextMSA
– A copy of the alignment in text-mode.
New in version 0.3.0.
-
sequences
¶ A view of the sequences in the alignment.
This property lets you access the individual sequences in the multiple sequence alignment as
DigitalSequence
instances.See also
The documentation for the
TextMSA.sequences
property, which contains some additional information.New in version 0.3.0.
- Type
_DigitalMSASequences
-
Sequence¶
-
class
pyhmmer.easel.
Sequence
¶ An abstract biological sequence with some associated metadata.
Easel provides two different mode to store a sequence: text, or digital. In the HMMER code, changing from one mode to another mode is done in place, which allows recycling memory. However, doing so can be confusing since there is no way to know statically the representation of a sequence.
To avoid this,
pyhmmer
provides two subclasses of theSequence
abstract class to maintain the mode contract:TextSequence
andDigitalSequence
. Functions expecting sequences in digital format, likepyhmmer.hmmsearch
, can then use Python type system to make sure they receive sequences in the right mode. This allows type checkers such asmypy
to detect potential contract breaches at compile-time.-
checksum
()¶ Calculate a 32-bit checksum for the sequence.
-
clear
()¶ Reinitialize the sequence for re-use.
-
copy
()¶ Duplicate the sequence, and return the copy.
-
-
class
pyhmmer.easel.
TextSequence
(Sequence)¶ A biological sequence stored in text mode.
Hint
Use the
sequence
property to access the sequence letters as a Python string.-
__init__
(name=None, description=None, accession=None, sequence=None, source=None)¶ Create a new text-mode sequence with the given attributes.
-
copy
()¶ Duplicate the text sequence, and return the copy.
-
digitize
(alphabet)¶ Convert the text sequence to a digital sequence using
alphabet
.- Returns
DigitalSequence
– A copy of the sequence in digital-model, digitized withalphabet
.
-
reverse_complement
()¶ Build the reverse complement of the sequence.
This method assumes that the sequence alphabet is IUPAC/DNA. If the sequence contains any unknown letters, they will be replaced by \(N\) in the reverse-complement.
- Parameters
inplace (
bool
) – Whether or not to copy the sequence before computing its reverse complement. WithFalse
(the default), the method will return a copy of the sequence that has been reverse-complemented. WithTrue
, it will reverse-complement inplace and returnNone
.- Raises
UserWarning – When the sequence contains unknown characters.
Example
>>> seq = TextSequence(sequence="ATGC") >>> seq.reverse_complement().sequence 'GCAT'
Caution
The copy made when
inplace
isFalse
is an exact copy, so thename
,description
andaccession
of the copy will be the same. This could lead to duplicates if you’re not careful!New in version 0.3.0.
-
-
class
pyhmmer.easel.
DigitalSequence
(Sequence)¶ A biological sequence stored in digital mode.
Hint
Use the
sequence
property to access the sequence digits as a memory view, allowing to access the individual bytes. This can be combined withnumpy.asarray
to get the sequence as an array with zero-copy.-
__init__
(alphabet, name=None, description=None, accession=None, sequence=None, source=None)¶ Create a new digital-mode sequence with the given attributes.
New in version 0.1.4.
-
copy
()¶ Duplicate the digital sequence, and return the copy.
-
reverse_complement
()¶ Build the reverse complement of the sequence.
- Parameters
inplace (
bool
) – Whether or not to copy the sequence before computing its reverse complement. WithFalse
(the default), the method will return a copy of the sequence that has been reverse-complemented. WithTrue
, it will reverse-complement inplace and returnNone
.- Raises
ValueError – When the alphabet of the
DigitalSequence
doesnot have a complement mapping set (e.g., Alphabet.amino) –
Caution
The copy made when
inplace
isFalse
is an exact copy, so thename
,description
andaccession
of the copy will be the same. This could lead to duplicates if you’re not careful!New in version 0.3.0.
-
textize
()¶ Convert the digital sequence to a text sequence.
- Returns
TextSequence
– A copy of the sequence in text-mode.
New in version 0.1.4.
-
sequence
¶ The raw sequence digits, as a memory view.
- Type
-
Sequence File¶
-
class
pyhmmer.easel.
SequenceFile
¶ A wrapper around a sequence file, containing unaligned sequences.
This class supports reading sequences stored in different formats, such as FASTA, GenBank or EMBL. The format of each file can be automatically detected, but it is also possible to pass an explicit format specifier when the
SequenceFile
is instantiated.New in version 0.2.0: The
alphabet
attribute.-
__init__
(file, format=None)¶ Create a new sequence file parser wrapping the given
file
.
-
close
()¶ Close the file and free the resources used by the parser.
-
guess_alphabet
()¶ Guess the alphabet of an open
SequenceFile
.This method tries to guess the alphabet of a sequence file by inspecting the first sequence in the file. It returns the alphabet, or
None
if the file alphabet cannot be reliably guessed.- Raises
EOFError – if the file is empty.
OSError – if a parse error occurred.
ValueError – if this methods is called after the file was closed.
-
parse
(buffer, format)¶ Parse a sequence from a binary
buffer
using the givenformat
.
-
parseinto
(seq, buffer, format)¶ Parse a sequence from a binary
buffer
intoseq
.
-
read
(skip_info=False, skip_sequence=False)¶ Read the next sequence from the file.
- Parameters
- Returns
Sequence
– The next sequence in the file, orNone
if all sequences were read from the file.- Raises
ValueError – When attempting to read a sequence from a closed file, or when the file could not be parsed.
Hint
This method allocates a new sequence, which is not efficient in case the sequences are being read within a tight loop. Use
SequenceFile.readinto
with an already initializedSequence
if you can to recycle the internal buffers.
-
readinto
(seq, skip_info=False, skip_sequence=False)¶ Read the next sequence from the file, using
seq
to store data.- Parameters
seq (
Sequence
) – A sequence object to use to store the next entry in the file. If this sequence was used before, it must be properly reset (using theSequence.clear
method) before using it again withreadinto
.skip_info (
bool
) – PassTrue
to disable reading the sequence metadata, and only read the sequence letters. Defaults to False`.skip_sequence (
bool
) – PassTrue
to disable reading the sequence letters, and only read the sequence metadata. Defaults toFalse
.
- Returns
Sequence
– A reference toseq
that was passed as an input, orNone
if no sequences are left in the file.- Raises
ValueError – When attempting to read a sequence from a closed file, or when the file could not be parsed.
Example
Use
SequenceFile.readinto
to loop over the sequences in a file while recycling the sameSequence
buffer:>>> with SequenceFile("vendor/hmmer/testsuite/ecori.fa") as sf: ... seq = TextSequence() ... while sf.readinto(seq) is not None: ... # ... process seq here ... # ... seq.clear()
-
set_digital
(alphabet)¶ Set the
SequenceFile
to read in digital mode withalphabet
.This method can be called even after the first sequences have been read; it only affects subsequent sequences in the file.
-
Sequence / Subsequence Index¶
-
class
pyhmmer.easel.
SSIReader
¶ A read-only handler for sequence/subsequence index file.
-
class
Entry
(fd, record_offset, data_offset, record_length)¶ -
property
data_offset
¶ Alias for field number 2
-
property
fd
¶ Alias for field number 0
-
property
record_length
¶ Alias for field number 3
-
property
record_offset
¶ Alias for field number 1
-
property
-
class
FileInfo
(name, format)¶ -
property
format
¶ Alias for field number 1
-
property
name
¶ Alias for field number 0
-
property
-
__init__
(file)¶ Create a new SSI file reader for the file at the given location.
- Parameters
file (
str
) – The path to a sequence/subsequence index file to read.
-
close
()¶ Close the SSI file reader.
-
class
-
class
pyhmmer.easel.
SSIWriter
¶ A writer for sequence/subsequence index files.
-
__init__
(file)¶ Create a new SSI file write for the file at the given location.
- Parameters
- Raises
FileNotFoundError – When the path to the file cannot be resolved.
FileExistsError – When the file exists and
exclusive
isTrue
.
-
add_alias
(alias, key)¶ Make
alias
an alias ofkey
in the index.
-
add_file
(filename, format=0)¶ Add a new file to the index.
-
add_key
(key, fd, record_offset, data_offset=0, record_length=0)¶ Add a new entry to the index with the given
key
.
-
close
()¶ Close the SSI file writer.
-