Easel¶

High-level interface to the Easel C library.

Easel is a library developed by the Eddy/Rivas Lab to facilitate the development of biological software in C. It is used by HMMER and Infernal.

Data Structures¶

Bitfield¶

class pyhmmer.easel.Bitfield¶

A statically sized sequence of booleans stored as a packed bitfield.

Example

Instantiate a bitfield from an iterable, where each object will be tested for truth:

>>> bitfield = Bitfield([True, False, False])
>>> len(bitfield)
3
>>> bitfield[0]
True
>>> bitfield[1]
False

Use Bitfield.zeros and Bitfield.ones to initialize a bitfield of a given length with all fields set to \(0\) or \(1\):

>>> Bitfield.zeros(4)
Bitfield([False, False, False, False])
>>> Bitfield.ones(4)
Bitfield([True, True, True, True])

Use indexing to access and edit individual bits:

>>> bitfield[0] = True
>>> bitfield[0]
True
>>> bitfield[0] = False
>>> bitfield[0]
False

__copy__()¶: None

__deepcopy__(memo)¶: None

__getstate__()¶: None

__init__()¶

Create a new bitfield from an iterable of objects.

Objects yielded by the iterable can be of any type and will be tested for truth before setting the corresponding field.

Raises:: ValueError – When given an empty iterable.

__reduce_ex__(protocol)¶: None

__setstate__(state)¶: None

__sizeof__()¶: None

classmethod _from_raw_bytes(buffer, n, byteorder)¶: None

copy()¶: Return a copy of this bitfield object.

Added in version 0.7.0.

count(value=1)¶

Count the number occurrences of value in the bitfield.

If no argument is given, counts the number of True occurences.

Example

>>> bitfield = Bitfield.zeros(8)
>>> bitfield.count(False)
8
>>> bitfield[0] = bitfield[1] = True
>>> bitfield.count()
2

classmethod ones(n)¶: Create a new bitfield of size n with all elements set to True.

Added in version 0.7.0.

toggle(index)¶

Switch the value of one single bit.

Example

>>> bitfield = Bitfield.zeros(8)
>>> bitfield[0]
False
>>> bitfield.toggle(0)
>>> bitfield[0]
True
>>> bitfield.toggle(0)
>>> bitfield[0]
False

classmethod zeros(n)¶: Create a new bitfield of size n with all elements set to False.

Added in version 0.7.0.

KeyHash¶

class pyhmmer.easel.KeyHash¶

A dynamically resized container to store byte keys using a hash table.

Internally uses Bob Jenkins’ one at a time hash, a simple and efficient hash function published in 1997 that exhibits avalanche behaviour.

Example

Add new keys to the key hash using the add method like you would with a Python set:

>>> kh = KeyHash()
>>> kh.add(b"key")
0

Check if a key hash contains a given key:

>>> b"key" in kh
True
>>> b"missing" in kh
False

Get the index associated with a key using the indexing notation:

>>> kh[b"key"]
0
>>> kh[b"missing"]
Traceback (most recent call last):
  ...
KeyError: b'missing'

Iterate over the keys of the key hash, in the order of insertion:

>>> kh.add(b"key2")
1
>>> for k in kh:
...     print(k)
b'key'
b'key2'

See also

The Wikipedia article for Bob Jenkins’ hash functions: https://en.wikipedia.org/wiki/Jenkins_hash_function

__copy__()¶: None

__getstate__()¶: None

__init__()¶: Create a new empty key-hash collection.

__setstate__(state)¶: None

__sizeof__()¶: None

add(key)¶

Add a new key to the hash table, and return its index.

If key was already in the hash table, the previous index is returned:

>>> kh = KeyHash()
>>> kh.add(b"first")
0
>>> kh.add(b"second")
1
>>> kh.add(b"first")
0

Parameters:: key (bytes) – The key to add to the hash table.
Returns:: int – The index corresponding to the added key.

Added in version 0.3.0.

clear()¶: Remove all entries from the collection.

copy()¶

Create and return an exact copy of this mapping.

Example

>>> kh = KeyHash()
>>> kh.add(b"key")
0
>>> copy = kh.copy()
>>> b"key" in copy
True

Sequences¶

Sequence¶

class pyhmmer.easel.Sequence¶

An abstract biological sequence with some associated metadata.

Easel provides two different mode to store a sequence: text, or digital. In the HMMER code, changing from one mode to another mode is done in place, which allows recycling memory. However, doing so can be confusing since there is no way to know statically the representation of a sequence.

To avoid this, pyhmmer provides two subclasses of the Sequence abstract class to maintain the mode contract: TextSequence and DigitalSequence. Functions expecting sequences in digital format, like pyhmmer.hmmer.hmmsearch, can then use Python type system to make sure they receive sequences in the right mode. This allows type checkers such as mypy to detect potential contract breaches at compile-time.

__copy__()¶: None

__sizeof__()¶: None

checksum()¶: Calculate a 32-bit checksum for the sequence.

clear()¶: Reinitialize the sequence for re-use.

copy()¶: Duplicate the sequence, and return the copy.

write(fh)¶

Write the sequence alignement to a file handle, in FASTA format.

Parameters:: fh (io.IOBase) – A Python file handle, opened in binary mode.

Added in version 0.3.0.

accession¶

The accession of the sequence.

Type:: bytes

description¶

The description of the sequence.

Type:: bytes

name¶

The name of the sequence.

Type:: bytes

residue_markups¶

Extra residue markups, mapping information to each position.

Keys and values are not decoded, since they are not necessarily valid UTF-8 bytestrings.

Caution

The values of the dictionary must be the same size as the sequence itself. Trying to set a residue markup of the wrong length will raise a ValueError:

>>> seq = TextSequence(sequence="TTAATTGGT")
>>> seq.residue_markups = {b"quality": b"efcfffffcfee"}
Traceback (most recent call last):
  ...
ValueError: Residue markup annotation has an invalid length (expected 9, got 12)

Added in version 0.4.6.

Type:: dict

source¶

The source of the sequence, if any.

Type:: bytes

TextSequence¶

class pyhmmer.easel.TextSequence(Sequence)¶

A biological sequence stored in text mode.

Hint

Use the sequence property to access the sequence letters as a Python string.

Added in version 0.10.4: pickle protocol support.

__init__()¶: Create a new text-mode sequence with the given attributes.

Added in version 0.10.4: The residue_markups argument.

__reduce__()¶: None

copy()¶: Duplicate the text sequence, and return the copy.

digitize(alphabet)¶

Convert the text sequence to a digital sequence using alphabet.

Returns:: DigitalSequence – A copy of the sequence in digital mode, digitized with alphabet.

reverse_complement(inplace=False)¶

Build the reverse complement of the sequence.

This method assumes that the sequence alphabet is IUPAC/DNA. If the sequence contains any unknown letters, they will be replaced by \(N\) in the reverse-complement.

Parameters:: inplace (bool) – Whether or not to copy the sequence before computing its reverse complement. With False (the default), the method will return a copy of the sequence that has been reverse-complemented. With True, it will reverse-complement inplace and return None.
Raises:: UserWarning – When the sequence contains unknown characters.

Example

>>> seq = TextSequence(sequence="ATGC")
>>> seq.reverse_complement().sequence
'GCAT'

Caution

The copy made when inplace is False is an exact copy, so the name, description and accession of the copy will be the same. This could lead to duplicates if you’re not careful!

Added in version 0.3.0.

sequence¶

The raw sequence letters, as a Python string.

Type:: str

DigitalSequence¶

class pyhmmer.easel.DigitalSequence(Sequence)¶

A biological sequence stored in digital mode.

alphabet¶

The biological alphabet used to encode this sequence to digits.

Type:: Alphabet, readonly

Hint

Use the sequence property to access the sequence digits as a memory view, allowing to access the individual bytes. This can be combined with numpy.asarray to get the sequence as an array with zero-copy.

Added in version 0.10.4: pickle protocol support.

__init__()¶

Create a new digital-mode sequence with the given attributes.

Raises:: ValueError – When sequence contains digits outside the alphabet symbol range.

Added in version 0.1.4.

Added in version 0.10.4: The residue_markups argument.

__reduce__()¶: None

copy()¶: Duplicate the digital sequence, and return the copy.

reverse_complement(inplace=False)¶

Build the reverse complement of the sequence.

Parameters:: inplace (bool) – Whether or not to copy the sequence before computing its reverse complement. With False (the default), the method will return a copy of the sequence that has been reverse-complemented. With True, it will reverse-complement inplace and return None.
Raises:: ValueError – When the alphabet of the DigitalSequence does not have a complement mapping set (e.g., Alphabet.amino).

Caution

The copy made when inplace is False is an exact copy, so the name, description and accession of the copy will be the same. This could lead to duplicates if you’re not careful!

Added in version 0.3.0.

textize()¶

Convert the digital sequence to a text sequence.

Returns:: TextSequence – A copy of the sequence in text-mode.

Added in version 0.1.4.

translate(genetic_code=GeneticCode(1))¶

Translate the sequence using the given genetic code.

Parameters:

genetic_code (pyhmmer.easel.GeneticCode) – The genetic code to use for translating the sequence. If none provided, the default uses the standard translation table (1) and expects DNA sequences.

Returns:

pyhmmer.easel.DigitalSequence – The translation of the input sequence, in digital mode.

Raises:

pyhmmer.errors.AlphabetMismatch – When the genetic_code expects a different nucleotide alphabet than the one currently in use to encode the sequence.
ValueError – When sequence could not be translated properly, because of a codon could not be recognized, or because the sequence has an invalid length.

Note

The translation of a DNA/RNA codon supports ambiguous codons. If the amino acid is unambiguous, despite codon ambiguity, the correct amino acid is still determined: GGR translates as Gly, UUY as Phe, etc. If there is no single unambiguous amino acid translation, the codon is translated as X. Ambiguous amino acids (such as J or B) are never produced.

Added in version 0.7.2.

sequence¶

The raw sequence digits, as a byte vector.

Note

The internal ESL_SQ object allocates a buffer of size \(n+2\) (where \(n\) is the number of residues in the sequence), with the first and the last element of the buffer being sentinel values. This vector does not expose the sentinel values, only the \(n\) elements of the buffer in between.

Changed in version 0.4.0: Property is now a VectorU8 instead of a memoryview.

Type:: VectorU8

Sequence Blocks¶

SequenceBlock¶

class pyhmmer.easel.SequenceBlock¶

An abstract container for storing Sequence objects.

To pass the target sequences efficiently in Pipeline.search_hmm, an array is allocated so that the inner loop can iterate over the target sequences without having to acquire the GIL for each new sequence (this gave a huge performance boost in v0.4.5). However, there was no way to reuse this between different queries; some memory recycling was done, but the target sequences had to be indexed for every query. This class allows synchronizing a Python list of Sequence objects with an internal C-contiguous buffer of pointers to ESL_SQ structs that can be used in the HMMER search loop.

Added in version 0.7.0.

__copy__()¶: None

__reduce__()¶: None

__sizeof__()¶: None

clear()¶: Remove all sequences from the block.

copy()¶: Return a copy of the sequence block.

Note

The sequence internally refered to by this collection are not copied. Use copy.deepcopy if you also want to duplicate the internal storage of each sequence.

extend(iterable)¶: Extend block by appending sequences from the iterable.

largest()¶: Return the largest sequence in the block.

TextSequenceBlock¶

class pyhmmer.easel.TextSequenceBlock(SequenceBlock)¶

A container for storing TextSequence objects.

Added in version 0.7.0.

Added in version 0.10.4: pickle protocol support.

__init__(*args, **kwargs)¶

__reduce__()¶: None

append(sequence)¶: Append sequence at the end of the block.

copy()¶: Return a copy of the text sequence block.

Note

The sequence internally refered to by this collection are not copied. Use copy.deepcopy is you also want to duplicate the internal storage of each sequence.

digitize(alphabet)¶: Create a block containing sequences from this block in digital mode.

index(sequence, start=0, stop=9223372036854775807)¶

Return the index of the first occurence of sequence.

Raises:: ValueError – When the block does not contain sequence.

insert(index, sequence)¶: Insert a new sequence in the block before index.

largest()¶: Return the largest sequence in the block.

pop(index=-1)¶: Remove and return a sequence from the block (the last one by default).

remove(sequence)¶: Remove the first occurence of the given sequence.

DigitalSequenceBlock¶

class pyhmmer.easel.DigitalSequenceBlock(SequenceBlock)¶

A container for storing DigitalSequence objects.

alphabet¶

The biological alphabet shared by all sequences in the collection.

Type:: Alphabet, readonly

Added in version 0.7.0.

Added in version 0.10.4: pickle protocol support.

__init__()¶

Create a new digital sequence block with the given alphabet.

Parameters:

alphabet (Alphabet) – The alphabet to use for all the sequences in the block.
iterable (iterable of DigitalSequence) – An initial collection of digital sequences to add to the block.

Raises:

AlphabetMismatch – When the alphabet of one of the sequences does not match alphabet.

__reduce__()¶: None

append(sequence)¶: Append sequence at the end of the block.

copy()¶: Return a copy of the digital sequence block.

Note

The sequence internally refered to by this collection are not copied. Use copy.deepcopy is you also want to duplicate the internal storage of each sequence.

index(sequence, start=0, stop=9223372036854775807)¶

Return the index of the first occurence of sequence.

Raises:: ValueError – When the block does not contain sequence.

insert(index, sequence)¶: Insert a new sequence in the block before index.

largest()¶: None

pop(index=-1)¶: Remove and return a sequence from the block (the last one by default).

remove(sequence)¶: Remove the first occurence of the given sequence.

textize()¶: Create a block containing sequences from this block in text mode.

translate(genetic_code=GeneticCode(1))¶

Translate the sequence block using the given genetic code.

Parameters:

Returns:

pyhmmer.easel.DigitalSequenceBlock – The translation of each sequence from the block, in digital mode.

Raises:

pyhmmer.errors.AlphabetMismatch – When the genetic_code expects a different nucleotide alphabet than the one currently for the sequences in the block.
ValueError – When a sequence from the block could not be translated properly, because of a codon could not be recognized, or because the sequence has an invalid length.

See also

pyhmmer.easel.DigitalSequence.translate for more information on how ambiguous nucleotides are handled.

SequenceFile¶

class pyhmmer.easel.SequenceFile¶

A wrapper around a sequence file, containing unaligned sequences.

This class supports reading sequences stored in different formats, such as FASTA, GenBank or EMBL. The format of each file can be automatically detected, but it is also possible to pass an explicit format specifier when the SequenceFile is instantiated.

Hint

SequenceFile objects can also be used to parse files containing multiple sequence alignments: in that case, the sequences will be read sequentially, removing the gap characters:

>>> with SequenceFile("tests/data/msa/LuxC.sto") as sf:
...     sequences = sf.read_block()
>>> print(sequences[0].name[:6], sequences[0].sequence[:30])
b'Q9KV99' LANQPLEAILGLINEARKSWSSTPELDPYR
>>> print(sequences[1].name[:6], sequences[1].sequence[:30])
b'Q2WLE3' IYSYPSEAMIEIINEYSKILCSDRKFLSYE

Added in version 0.2.0: The alphabet attribute.

Changed in version 0.4.8: Support reading sequences from a file-like handle. Support reading individual sequences from an MSA file.

__enter__()¶: None

__exit__(exc_type, exc_value, traceback)¶: None

__init__()¶

Create a new sequence file parser wrapping the given file.

Parameters:

file (str or file-like object) – Either the path to a file containing the sequences to read, or a file-like object opened in binary mode.
format (str, optional) – The format of the file, or None to autodetect. Supported values are: fasta, embl, genbank, ddbj, uniprot, ncbi, daemon, hmmpgmd, fmindex, plus any format also supported by MSAFile.
digital (bool) – Whether to read the sequences in text or digital mode. This will affect the type of Sequence objects returned later by the read function.
alphabet (Alphabet) – The alphabet to use to digitize the sequences while reading. If None given, it will be guessed based on the contents of the first sequence.

Raises:

ValueError – When format is not a valid sequence format.
OSError – If an internal parser error occurred while guessing the alphabet or the format.

Caution

SequenceFile can generally read sequences from binary-mode file-like objects, except for sequences in an NCBI BLAST database, since it is composed of multiple files. Reading from an NCBI BLAST database passed from a filename is however supported.

Changed in version 0.4.4: Added the ignore_gaps parameter.

Changed in version 0.4.8: Support reading from a file-like object (except NCBI format).

Changed in version 0.5.0: Added the digital and alphabet keyword arguments.

Deprecated since version 0.6.0: The ignore_gaps keyword argument, use afa format instead.

Changed in version 0.8.0: Removed the ignore_gaps keyword argument.

close()¶: Close the file and free the resources used by the parser.

guess_alphabet()¶

Guess the alphabet of an open SequenceFile.

This method tries to guess the alphabet of a sequence file by inspecting the first sequence in the file. It returns the alphabet, or None if the file alphabet cannot be reliably guessed.

Raises:

EOFError – if the file is empty.
OSError – if a parse error occurred.
ValueError – if this methods is called on a closed file.

Example

>>> with SequenceFile("tests/data/seqs/bmyD.fna") as sf:
...     sf.guess_alphabet()
Alphabet.dna()
>>> with SequenceFile("tests/data/seqs/LuxC.faa") as sf:
...     sf.guess_alphabet()
Alphabet.amino()

Added in version 0.6.3.

classmethod parse(buffer, format, *, alphabet=None)¶

Parse a sequence from a binary buffer using the given format.

Argument:

buffer (bytes or byte-like buffer): A buffer containing the: sequence data to parse. Any type implementing the buffer protocol (such as bytes, bytearray, or memoryview) is supported.
format (str): The format of the sequence data. See the: SequenceFile.__init__ documentation for allowed values.

Keyword Arguments:

alphabet (Alphabet) – The alphabet to use to digitize the returned sequence, if desired.

Returns:

Sequence – The sequenced parsed from the buffer, either as a DigitalSequence if an alphabet was provided, or as a TextSequence if None was given.

Raises:

ValueError – When format is not a valid sequence format.
OSError – If an internal parser error occurred while guessing the alphabet or the format.

classmethod parseinto(seq, buffer, format)¶

Parse a sequence from a binary buffer into seq.

Argument:

seq (Sequence): The sequence object into which: the deseriazlied sequence data will be written.
buffer (bytes or byte-like buffer): A buffer containing the: sequence data to parse. Any type implementing the buffer protocol (such as bytes, bytearray, or memoryview) is supported.
format (str): The format of the sequence data. See the: SequenceFile.__init__ documentation for allowed values.

Raises:

ValueError – When format is not a valid sequence format.
OSError – If an internal parser error occurred while guessing the alphabet or the format.

Returns:

Sequence – The sequence given as argument, or None if the end of the file was reached.

read(skip_info=False, skip_sequence=False)¶

Read the next sequence from the file.

Parameters:

skip_info (bool) – Pass True to disable reading the sequence metadata, and only read the sequence letters. Defaults to False.
skip_sequence (bool) – Pass True to disable reading the sequence letters, and only read the sequence metadata. Defaults to False.

Returns:

Sequence – The next sequence in the file, or None if all sequences were read from the file.

Raises:

ValueError – When attempting to read a sequence from a closed file, or when the file could not be parsed.

Hint

This method allocates a new sequence, which is not efficient in case the sequences are being read within a tight loop. Use SequenceFile.readinto with an already initialized Sequence if you can to recycle the internal buffers.

read_block(sequences=None, residues=None)¶

Read several sequences into a sequence block.

Parameters:

sequences (int, optional) – The maximum number of sequences to read before returning a block. Leave as None to read all remaining sequences from the file.
residues (int, optional) – The number of residues to read before returning the block. Leave as None to keep reading sequences without a residue limit.

Returns:

SequenceBlock – A sequence block object, which may be empty if there are no sequences to read anymore. The concrete type depends on whether the SequenceFile was opened in text or digital mode.

Raises:

ValueError – When attempting to read a sequence from a closed file, or when the file could not be parsed.

Example

Read a block of at most 4 sequences from a sequence file:

>>> with SequenceFile("tests/data/seqs/LuxC.faa") as sf:
...     block = sf.read_block(sequences=4)
>>> len(block)
4

Read sequences until the block contains at least 1000 residues:

>>> with SequenceFile("tests/data/seqs/LuxC.faa") as sf:
...     block = sf.read_block(residues=1000)
>>> len(block)
3
>>> len(block[0]) + len(block[1]) + len(block[2])
1444

Note that the last sequence will not be truncated, so the block will always contain more than max_residues unless the end of the file was reached.

Added in version 0.7.0.

readinto(seq, skip_info=False, skip_sequence=False)¶

Read the next sequence from the file, using seq to store data.

Parameters:

seq (Sequence) – A sequence object to use to store the next entry in the file. If this sequence was used before, it must be properly reset (using the Sequence.clear method) before using it again with readinto.
skip_info (bool) – Pass True to disable reading the sequence metadata, and only read the sequence letters. Defaults to False.
skip_sequence (bool) – Pass True to disable reading the sequence letters, and only read the sequence metadata. Defaults to False.

Returns:

Sequence – A reference to seq that was passed as an input, or None if no sequences are left in the file.

Raises:

ValueError – When attempting to read a sequence from a closed file, or when the file could not be parsed.

Example

Use SequenceFile.readinto to loop over the sequences in a file while recycling the same Sequence buffer:

>>> with SequenceFile("tests/data/seqs/LuxC.faa") as sf:
...     seq = TextSequence()
...     while sf.readinto(seq) is not None:
...         # ... process seq here ... #
...         seq.clear()

rewind()¶

Rewind the file back to the beginning.

For sequential formats, this method is supported for both path-based and file object-based sequence files. For multiple-sequence alignment formats, the underlying MSAFile needs to be reopened, so this is only supported for path-based files.

Raises:: io.UnsupportedOperation – When attempting to rewind a sequence file where the underlying stream is a file-like object that does not support the seek method.

closed¶

Whether the SequenceFile is closed or not.

Type:: bool

digital¶

Whether the SequenceFile is in digital mode or not.

Added in version 0.5.0.

Type:: bool

format¶

The format of the SequenceFile.

Type:: str

Alignments¶

MSA¶

class pyhmmer.easel.MSA¶

An abstract alignment of multiple sequences.

Hint

Use len(msa) to get the number of columns in the alignment, and len(msa.sequences) to get the number of sequences (i.e. the number of rows).

__copy__()¶: None

checksum()¶: Calculate a 32-bit checksum for the multiple sequence alignment.

write(fh, format)¶

Write the multiple sequence alignement to a file handle.

Parameters:

fh (io.IOBase) – A Python file handle, opened in binary mode.
format (str) – The name of the multiple sequence alignment file format to use.

Added in version 0.3.0.

accession¶

The accession of the alignment, if any.

Type:: bytes or None

author¶

The author of the alignment, if any.

Type:: bytes or None

description¶

The description of the alignment, if any.

Type:: bytes or None

name¶

The name of the alignment, if any.

Type:: bytes or None

names¶

The name of each sequence in the alignment.

Every sequence in the alignment is required to have a name, so no member of the tuple will ever be None.

Example

>>> s1 = TextSequence(name=b"seq1", sequence="ATGC")
>>> s2 = TextSequence(name=b"seq2", sequence="ATGC")
>>> msa = TextMSA(name=b"msa", sequences=[s1, s2])
>>> msa.names
(b'seq1', b'seq2')

Added in version 0.4.8.

Type:: tuple of bytes

TextMSA¶

class pyhmmer.easel.TextMSA(MSA)¶

A multiple sequence alignement stored in text mode.

__init__()¶

Create a new text-mode alignment with the given sequences.

Parameters:

name (bytes, optional) – The name of the alignment, if any.
description (bytes, optional) – The description of the alignment, if any.
accession (bytes, optional) – The accession of the alignment, if any.
sequences (collection of TextSequence) – The sequences to store in the multiple sequence alignment. All sequences must have the same length. They also need to have distinct names.
author (bytes, optional) – The author of the alignment, often used to record the aligner it was created with.

Raises:

ValueError – When the alignment cannot be created from the given sequences.
TypeError – When sequences is not an iterable of TextSequence objects.

Example

>>> s1 = TextSequence(name=b"seq1", sequence="ATGC")
>>> s2 = TextSequence(name=b"seq2", sequence="ATGC")
>>> msa = TextMSA(name=b"msa", sequences=[s1, s2])
>>> len(msa)
4

Changed in version 0.3.0: Allow creating an alignment from an iterable of TextSequence.

copy()¶: Duplicate the text sequence alignment, and return the copy.

digitize(alphabet)¶

Convert the text alignment to a digital alignment using alphabet.

Returns:: DigitalMSA – An alignment in digital mode containing the same sequences digitized with alphabet.
Raises:: ValueError – When the text sequence contains invalid characters that cannot be converted according to alphabet.symbols.

alignment¶

A view of the aligned sequences as strings.

This property gives access to the aligned sequences, including gap characters, so that they can be displayed or processed column by column.

Examples

Use TextMSA.alignment to display an alignment in text format:

>>> for name, aligned in zip(luxc.names, luxc.alignment):
...     print(name, " ", aligned[:40], "...")
b'Q9KV99.1'   LANQPLEAILGLINEARKSWSST------------PELDP ...
b'Q2WLE3.1'   IYSYPSEAMIEIINEYSKILCSD------------RKFLS ...
b'Q97GS8.1'   VHDIKTEETIDLLDRCAKLWLDDNYSKK--HIETLAQITN ...
b'Q3WCI9.1'   LLNVPLKEIIDFLVETGERIRDPRNTFMQDCIDRMAGTHV ...
b'P08639.1'   LNDLNINNIINFLYTTGQRWKSEEYSRRRAYIRSLITYLG ...
...

Use the splat operator (*) in combination with the zip builtin to iterate over the columns of an alignment:

>>> for idx, col in enumerate(zip(*luxc.alignment)):
...     print(idx+1, col)
1 ('L', 'I', 'V', 'L', 'L', ...)
2 ('A', 'Y', 'H', 'L', 'N', ...)
...

Added in version 0.4.8.

Type:: tuple of str

sequences¶

A view of the sequences in the alignment.

This property lets you access the individual sequences in the multiple sequence alignment as TextSequence instances.

Example

Query the number of sequences in the alignment with len, or access individual members via indexing notation:

>>> s1 = TextSequence(name=b"seq1", sequence="ATGC")
>>> s2 = TextSequence(name=b"seq2", sequence="ATGC")
>>> msa = TextMSA(name=b"msa", sequences=[s1, s2])
>>> len(msa.sequences)
2
>>> msa.sequences[0].name
b'seq1'

Caution

Sequences in the list are copies, so editing their attributes will have no effect on the alignment:

>>> msa.sequences[0].name
b'seq1'
>>> msa.sequences[0].name = b"seq1bis"
>>> msa.sequences[0].name
b'seq1'

Support for this feature may be added in a future version, but can be circumvented for now by forcingly setting the updated version of the object:

>>> seq = msa.sequences[0]
>>> seq.name = b"seq1bis"
>>> msa.sequences[0] = seq
>>> msa.sequences[0].name
b'seq1bis'

Added in version 0.3.0.

Type:: _TextMSASequences

DigitalMSA¶

class pyhmmer.easel.DigitalMSA(MSA)¶

A multiple sequence alignment stored in digital mode.

alphabet¶

The biological alphabet used to encode this sequence alignment to digits.

Type:: Alphabet

__init__()¶

Create a new digital-mode alignment with the given sequences.

Parameters:

alphabet (Alphabet) – The alphabet of the alignmed sequences.
name (bytes, optional) – The name of the alignment, if any.
description (bytes, optional) – The description of the alignment, if any.
accession (bytes, optional) – The accession of the alignment, if any.
sequences (iterable of DigitalSequence) – The sequences to store in the multiple sequence alignment. All sequences must have the same length and alphabet. They also need to have distinct names set.
author (bytes, optional) – The author of the alignment, often used to record the aligner it was created with.

Changed in version 0.3.0: Allow creating an alignment from an iterable of DigitalSequence.

copy()¶: Duplicate the digital sequence alignment, and return the copy.

textize()¶

Convert the digital alignment to a text alignment.

Returns:: TextMSA – A copy of the alignment in text-mode.

Added in version 0.3.0.

sequences¶

A view of the sequences in the alignment.

This property lets you access the individual sequences in the multiple sequence alignment as DigitalSequence instances.

See also

The documentation for the TextMSA.sequences property, which contains some additional information.

Added in version 0.3.0.

Type:: _DigitalMSASequences

MSAFile¶

class pyhmmer.easel.MSAFile¶

A wrapper around a multiple-alignment file.

This class supports reading sequences stored in different formats, such as Stockholm, A2M, PSI-BLAST or Clustal.

name¶

The name of the MSA file, if it was created from a filename, or None if it wraps a file-like object.

Type:: str, optional

Hint

Some Clustal files created by alignment tools other than Clustal (such as MUSCLE or MAFFT, for instance), may not contain the header expected by Easel for the Clustal format. If you get an error while trying to parse these files, use the "clustallike" format instead of the "clustal" format when creating the MSAFile.

__enter__()¶: None

__exit__(exc_type, exc_value, traceback)¶: None

__init__()¶

Create a new MSA file parser wrapping the given file.

Parameters:

file (str or file-like object) – Either the path to a file containing the sequences to read, or a file-like object opened in binary mode.
format (str, optional) – The format of the file, or None to autodetect. Supported values are: stockholm, pfam, a2m, psiblast, selex, afa (aligned FASTA), clustal, clustallike, phylip, phylips.

Keyword Arguments:

digital (bool) – Whether to read the sequences in text or digital mode. This will affect the type of MSA objects returned later by the read function.
alphabet (Alphabet) – The alphabet to use to digitize the sequences while reading. If None given, it will be guessed based on the contents of the first sequence.

Raises:

ValueError – When format is not a valid MSA format.

Changed in version 0.4.8: Support reading from a file-like object.

Changed in version 0.5.0: Added the digital and alphabet keyword arguments.

close()¶: Close the file and free the resources used by the parser.

guess_alphabet()¶

Guess the alphabet of an open MSAFile.

This method tries to guess the alphabet of a multiple-alignment file by inspecting the first entry in the file. It returns the alphabet, or None if the file alphabet cannot be reliably guessed.

Raises:

EOFError – if the file is empty.
OSError – if a parse error occurred.
ValueError – if this methods is called after the file was closed.

Example

>>> with MSAFile("tests/data/msa/laccase.clw") as mf:
...     mf.guess_alphabet()
Alphabet.amino()

read()¶

Read the next alignment from the file.

Returns:: MSA – The next alignment in the file, or None if all the alignments were read from the file already.
Raises:: ValueError – When attempting to read an alignment from a closed file, or when the file could not be parsed.

closed¶

Whether the MSAFile is closed or not.

Type:: bool

digital¶

Whether the MSAFile is in digital mode or not.

Type:: bool

format¶

The format of the MSAFile.

Type:: str

Linear Algebra¶

Vector¶

class pyhmmer.easel.Vector¶

An abstract 1D array of fixed size.

Added in version 0.4.0.

__copy__()¶: None

__reduce_ex__(protocol)¶: None

__sizeof__()¶: None

classmethod _from_raw_bytes(buffer, n, byteorder)¶: None

argmax()¶

Return index of the maximum element in the vector.

Raises:: ValueError – When called on an empty vector.

argmin()¶

Return index of the minimum element in the vector.

Raises:: ValueError – When called on an empty vector.

copy()¶: Create a copy of the vector, allocating a new buffer.

max()¶

Return value of the maximum element in the vector.

Raises:: ValueError – When called on an empty vector.

min()¶

Return value of the minimum element in the vector.

Raises:: ValueError – When called on an empty vector.

reverse()¶: Reverse the vector, in place.

sum()¶: Returns the scalar sum of all elements in the vector.

classmethod zeros(n)¶: Create a vector of size n filled with zeros.

format¶

The format of each item in the vector.

See also

The array module of the Python standard library for a detail about available type codes.

Added in version 0.4.6.

Type:: str

itemsize¶

The size of each item in the vector, in bytes.

Added in version 0.4.6.

Type:: int

shape¶

The shape of the vector.

Type:: tuple

strides¶

The strides of the vector.

Type:: tuple

VectorF¶

class pyhmmer.easel.VectorF¶

A vector storing single-precision floating point numbers.

Individual elements of a vector can be accessed and modified with the usual indexing notation:

>>> v = VectorF([1.0, 2.0, 3.0])
>>> v[0]
1.0
>>> v[-1]
3.0
>>> v[0] = v[-1] = 4.0
>>> v
VectorF([4.0, 2.0, 4.0])

Slices are also supported, and they do not copy data (use the copy method to allocate a new vector):

>>> v = VectorF(range(6))
>>> v[2:5]
VectorF([2.0, 3.0, 4.0])
>>> v[2:-1] = 10.0
>>> v
VectorF([0.0, 1.0, 10.0, 10.0, 10.0, 5.0])

Addition and multiplication is supported for scalars, in place or not:

>>> v = VectorF([1.0, 2.0, 3.0])
>>> v += 1
>>> v
VectorF([2.0, 3.0, 4.0])
>>> v * 3
VectorF([6.0, 9.0, 12.0])

Pairwise operations can also be performed, but only on vectors of the same dimension and precision:

>>> v = VectorF([1.0, 2.0, 3.0])
>>> v * v
VectorF([1.0, 4.0, 9.0])
>>> v += VectorF([3.0, 4.0, 5.0])
>>> v
VectorF([4.0, 6.0, 8.0])
>>> v *= VectorF([1.0])
Traceback (most recent call last):
  ...
ValueError: cannot pairwise multiply vectors of different sizes

Objects of this type support the buffer protocol, and can be viewed as a numpy.ndarray of one dimension using the numpy.asarray function, and can be passed without copy to most numpy functions:

>>> v = VectorF([1.0, 2.0, 3.0])
>>> numpy.asarray(v)
array([1., 2., 3.], dtype=float32)
>>> numpy.log2(v)
array([0.       , 1.       , 1.5849625], dtype=float32)

Added in version 0.4.0.

__init__()¶: Create a new float vector from the given data.

argmax()¶: None

argmin()¶: None

copy()¶: None

entropy()¶

Compute the Shannon entropy of the vector.

The Shannon entropy of a probability vector is defined as:

\[H = \sum_{i=0}^{N}{\log_2 p_i}\]

Example

>>> easel.VectorF([0.1, 0.1, 0.3, 0.5]).entropy()
1.6854...
>>> easel.VectorF([0.25, 0.25, 0.25, 0.25]).entropy()
2.0

References

Cover, Thomas M., and Thomas, Joy A. Entropy, Relative Entropy, and Mutual Information. In Elements of Information Theory, 13–55. Wiley (2005): 2. doi:10.1002/047174882X.ch2 ISBN:9780471241959.

Added in version 0.4.10.

max()¶: None

min()¶: None

normalize()¶: Normalize a vector so that all elements sum to 1.

Caution

If sum is zero, sets all elements to \(\frac{1}{n}\), where \(n\) is the size of the vector.

relative_entropy(other)¶

Compute the relative entropy between two probability vectors.

The Shannon relative entropy of two probability vectors \(p\) and \(q\), also known as the Kullback-Leibler divergence, is defined as:

\[D(p \parallel q) = \sum_i p_i \log_2 \frac{p_i}{q_i}.\]

with \(D(p \parallel q) = \infty\) per definition if \(q_i = 0\) and \(p_i > 0\) for any \(i\).

Example

>>> v1 = easel.VectorF([0.1, 0.1, 0.3, 0.5])
>>> v2 = easel.VectorF([0.25, 0.25, 0.25, 0.25])
>>> v1.relative_entropy(v2)
0.3145...
>>> v2.relative_entropy(v1)   # this is no symmetric relation
0.3452...

References

Cover, Thomas M., and Thomas, Joy A. Entropy, Relative Entropy, and Mutual Information. In Elements of Information Theory, 13–55. Wiley (2005): 2. doi:10.1002/047174882X.ch2 ISBN:9780471241959.

Added in version 0.4.10.

reverse()¶: None

sum()¶

Returns the scalar sum of all elements in the vector.

Float summations use Kahan’s algorithm, in order to minimize roundoff error accumulation. Additionally, they are most accurate if the vector is sorted in increasing order, from small to large, so you may consider sorting the vector before summing it.

References

Kahan, W. Pracniques: Further Remarks on Reducing Truncation Errors. Communications of the ACM 8, no. 1 (1 January 1965): 40. doi:10.1145/363707.363723.

format¶

itemsize¶

VectorU8¶

class pyhmmer.easel.VectorU8¶

A vector storing byte-sized unsigned integers.

Added in version 0.4.0.

__init__()¶: Create a new byte vector from the given data.

argmax()¶: None

argmin()¶: None

copy()¶: None

max()¶: None

min()¶: None

reverse()¶: None

sum()¶

Returns the scalar sum of all elements in the vector.

Caution

The sum is wrapping:

>>> vec = VectorU8([255, 2])
>>> vec.sum()
1

format¶

itemsize¶

Matrix¶

class pyhmmer.easel.Matrix¶

An abstract 2D array of fixed size.

Added in version 0.4.0.

__copy__()¶: None

__reduce_ex__(protocol)¶: None

__sizeof__()¶: None

classmethod _from_raw_bytes(buffer, m, n, byteorder)¶: Create a new matrix using the given bytes to fill its contents.

argmax()¶

Return the coordinates of the maximum element in the matrix.

Raises:: ValueError – When called on an empty matrix.

argmin()¶

Return the coordinates of the minimum element in the matrix.

Raises:: ValueError – When called on an empty matrix.

copy()¶: Create a copy of the matrix, allocating a new buffer.

max()¶

Return the value of the maximum element in the matrix.

Raises:: ValueError – When called on an empty matrix.

min()¶

Return the value of the minimum element in the matrix.

Raises:: ValueError – When called on an empty matrix.

sum()¶: Return the sum of all elements in the matrix.

classmethod zeros(m, n)¶: Create a new \(m \times n\) matrix filled with zeros.

format¶

The format of each item in the matrix.

See also

The array module of the Python standard library for a detail about available type codes.

Added in version 0.4.7.

Type:: str

itemsize¶

The size of each item in the matrix, in bytes.

Added in version 0.4.7.

Type:: int

shape¶

The shape of the matrix.

Example

>>> m = MatrixF([ [1.0, 2.0], [3.0, 4.0], [5.0, 6.0] ])
>>> m.shape
(3, 2)

Type:: tuple

strides¶

The strides of the matrix.

Type:: tuple

MatrixF¶

class pyhmmer.easel.MatrixF¶

A matrix storing single-precision floating point numbers.

Use indexing notation to access and edit individual elements of the matrix:

>>> m = MatrixF.zeros(2, 2)
>>> m[0, 0] = 3.0
>>> m
MatrixF([[3.0, 0.0], [0.0, 0.0]])

Indexing can also be performed at the row-level to get a VectorF without copying the underlying data:

>>> m = MatrixF([ [1.0, 2.0], [3.0, 4.0] ])
>>> m[0]
VectorF([1.0, 2.0])

Objects of this type support the buffer protocol, and can be viewed as a numpy.ndarray with two dimensions using the numpy.asarray function, and can be passed without copy to most numpy functions:

>>> m = MatrixF([ [1.0, 2.0], [3.0, 4.0] ])
>>> numpy.asarray(m)
array([[1., 2.],
       [3., 4.]], dtype=float32)
>>> numpy.log2(m)
array([[0.       , 1.       ],
       [1.5849625, 2.       ]], dtype=float32)

Added in version 0.4.0.

__init__(*args, **kwargs)¶

argmax()¶: None

argmin()¶: None

copy()¶: None

max()¶: None

min()¶: None

sum()¶: None

format¶

itemsize¶

MatrixU8¶

class pyhmmer.easel.MatrixU8¶

A matrix storing byte-sized unsigned integers.

Added in version 0.4.0.

__init__(*args, **kwargs)¶

argmax()¶: None

argmin()¶: None

copy()¶: None

max()¶: None

min()¶: None

sum()¶: None

format¶

itemsize¶

Miscellaneous¶

Alphabet¶

class pyhmmer.easel.Alphabet¶

A biological alphabet, including additional marker symbols.

This type is used to share an alphabet to several objects in the easel and plan7 modules. Reference counting helps sharing the same instance everywhere, instead of reallocating memory every time an alphabet is needed.

Use the factory class methods to obtain a default Alphabet for one of the three standard biological alphabets:

>>> dna = Alphabet.dna()
>>> rna = Alphabet.rna()
>>> aa  = Alphabet.amino()

__reduce__()¶: None

__sizeof__()¶: None

classmethod amino()¶: Create a default amino-acid alphabet.

decode(sequence)¶

Decode a raw digital sequence into its textual representation.

Parameters:: sequence (object, buffer-like) – A raw sequence in digital format. Any object implementing the buffer protocol (like bytearray, VectorU8, etc.) may be given.
Returns:: str – A raw sequence in textual format.

Example

>>> alphabet = easel.Alphabet.amino()
>>> dseq = easel.VectorU8([0, 4, 2, 17, 3, 13, 0, 0, 5])
>>> alphabet.decode(dseq)
'AFDVEQAAG'

Added in version 0.6.3.

classmethod dna()¶: Create a default DNA alphabet.

encode(sequence)¶

Encode a raw text sequence into its digital representation.

Parameters:: sequence (str) – A raw sequence in text format.
Returns:: VectorU8 – A raw sequence in digital format.

Example

>>> alphabet = easel.Alphabet.dna()
>>> alphabet.encode("ACGT")
VectorU8([0, 1, 2, 3])

Added in version 0.6.3.

is_amino()¶: Check whether the Alphabet object is a protein alphabet.

is_dna()¶: Check whether the Alphabet object is a DNA alphabet.

is_nucleotide()¶: Check whether the Alphabet object is a nucleotide alphabet.

is_rna()¶: Check whether the Alphabet object is a RNA alphabet.

classmethod rna()¶: Create a default RNA alphabet.

K¶

The alphabet size, counting only actual alphabet symbols.

Example

>>> Alphabet.dna().K
4
>>> Alphabet.amino().K
20

Type:: int

Kp¶

The complete alphabet size, including marker symbols.

Example

>>> Alphabet.dna().Kp
18
>>> Alphabet.amino().Kp
29

Type:: int

symbols¶

The symbols composing the alphabet.

Example

>>> Alphabet.dna().symbols
'ACGT-RYMKSWHBVDN*~'
>>> Alphabet.rna().symbols
'ACGU-RYMKSWHBVDN*~'

Type:: str

type¶

The alphabet type, as a short string.

Example

>>> Alphabet.dna().type
'DNA'
>>> Alphabet.amino().type
'amino'

Added in version 0.8.2.

Type:: str

GeneticCode¶

class pyhmmer.easel.GeneticCode¶

A genetic code table for translation.

Added in version 0.7.2.

__reduce__()¶: None

translate(sequence)¶

Translate a raw nucleotide sequence into a protein.

Parameters:: sequence (object, buffer-like) – A raw sequence in digital format. Any object implementing the buffer protocol (like bytearray, VectorU8, etc.) may be given.
Returns:: pyhmmer.easel.VectorU8 – The translation of the input sequence, as a raw digital sequence.
Raises:: ValueError – When sequence could not be translated properly, because of a codon could not be recognized, or because the sequence has an invalid length.

Note

description¶

A description of the translation table currently in use.

Type:: str

translation_table¶

The translation table in use.

Can be set manually to a different number to change the translation table for the current GeneticCode object.

Type:: int

Randomness¶

class pyhmmer.easel.Randomness¶

A portable, thread-safe random number generator.

Methods with an implementation in Easel are named after the equivalent methods of random.Random.

Added in version 0.4.2.

__copy__()¶: None

__getstate__()¶: None

__init__()¶

Create a new random number generator with the given seed.

Parameters:

seed (int) – The seed to initialize the generator with. If 0 or None is given, an arbitrary seed will be chosen using the system clock.
fast (bool) – If True, use a linear congruential generator (LCG), which is low quality and should only be used for integration with legacy code. With False, use the Mersenne Twister MT19937 algorithm instead.

__reduce__()¶: None

__setstate__(state)¶: None

__sizeof__()¶: None

copy()¶: Return a copy of the random number generator in the same exact state.

getstate()¶: Get a tuple containing the current state.

normalvariate(mu, sigma)¶

Generate a Gaussian-distributed sample.

Parameters:

mu (float) – The mean of the Gaussian being sampled.
sigma (float) – The standard deviation of the Gaussian being sampled.

random()¶: Generate a uniform random deviate on \(\left[ 0, 1 \right)\).

seed(n=None)¶

Reinitialize the random number generator with the given seed.

Parameters:: n (int, optional) – The seed to use. If 0 or None, an arbitrary seed will be chosen using the current time.

setstate(state)¶: Restores the state of the random number generator.

fast¶

True when the linear congruential generator is in use.

Type:: bool

Sequence / Subsequence Index¶

class pyhmmer.easel.SSIReader¶

A read-only handler for sequence/subsequence index file.

class Entry(fd, record_offset, data_offset, record_length)¶

data_offset¶: Alias for field number 2

fd¶: Alias for field number 0

record_length¶: Alias for field number 3

record_offset¶: Alias for field number 1

class FileInfo(name, format)¶

format¶: Alias for field number 1

name¶: Alias for field number 0

__enter__()¶: None

__exit__(exc_value, exc_type, traceback)¶: None

__init__()¶

Create a new SSI file reader for the file at the given location.

Parameters:: file (str, bytes or os.PathLike) – The path to a sequence/subsequence index file to read.

close()¶: Close the SSI file reader.

file_info(fd)¶: Retrieve the FileInfo of the descriptor.

find_name(key)¶: Retrieve the Entry for the given name.

class pyhmmer.easel.SSIWriter¶

A writer for sequence/subsequence index files.

__enter__()¶: None

__exit__(exc_type, exc_value, traceback)¶: None

__init__()¶

Create a new SSI file write for the file at the given location.

Parameters:

file (str, bytes or os.PathLike) – The path to a sequence/subsequence index file to write.
exclusive (bool) – Whether or not to create a file if one does not exist.

Raises:

FileNotFoundError – When the path to the file cannot be resolved.
FileExistsError – When the file exists and exclusive is True.

add_alias(alias, key)¶: Make alias an alias of key in the index.

add_file(filename, format=0)¶

Add a new file to the index.

Parameters:

filename (str, bytes or os.PathLike) – The name of the file to register.
format (int) – A format code to associate with the file, or 0 by default.

Returns:

int – The filehandle associated with the new indexed file.

add_key(key, fd, record_offset, data_offset=0, record_length=0)¶: Add a new entry to the index with the given key.

close()¶: Close the SSI file writer.