Easel

High-level interface to the Easel C library.

Easel is a library developed by the Eddy/Rivas Lab to facilitate the development of biological software in C. It is used by HMMER and Infernal.

Alphabet

class pyhmmer.easel.Alphabet

A biological alphabet, including additional marker symbols.

This type is used to share an alphabet to several objects in the easel and plan7 modules. Reference counting helps sharing the same instance everywhere, instead of reallocating memory every time an alphabet is needed.

Use the factory class methods to obtain a default Alphabet for one of the three standard biological alphabets:

>>> dna = Alphabet.dna()
>>> rna = Alphabet.rna()
>>> aa  = Alphabet.amino()
amino()

Create a default amino-acid alphabet.

dna()

Create a default DNA alphabet.

rna()

Create a default RNA alphabet.

K

The alphabet size, counting only actual alphabet symbols.

Example

>>> Alphabet.dna().K
4
>>> Alphabet.amino().K
20
Type

int

Kp

The complete alphabet size, including marker symbols.

Example

>>> Alphabet.dna().Kp
18
>>> Alphabet.amino().Kp
29
Type

int

symbols

The symbols composing the alphabet.

Example

>>> Alphabet.dna().symbols
'ACGT-RYMKSWHBVDN*~'
>>> Alphabet.rna().symbols
'ACGU-RYMKSWHBVDN*~'
Type

str

Bitfield

class pyhmmer.easel.Bitfield

A statically sized sequence of booleans stored as a packed bitfield.

A bitfield is instantiated with a fixed length, and all booleans are set to False by default:

>>> bitfield = Bitfield(8)
>>> len(bitfield)
8
>>> bitfield[0]
False

Use indexing to access and edit individual bits:

>>> bitfield[0] = True
>>> bitfield[0]
True
>>> bitfield[0] = False
>>> bitfield[0]
False
__init__(length)

Create a new bitfield with the given length.

count(value=True)

Count the number occurrences of value in the bitfield.

If no argument is given, counts the number of True occurences.

Example

>>> bitfield = Bitfield(8)
>>> bitfield.count(False)
8
>>> bitfield[0] = bitfield[1] = True
>>> bitfield.count()
2
toggle(index)

Switch the value of one single bit.

Example

>>> bitfield = Bitfield(8)
>>> bitfield[0]
False
>>> bitfield.toggle(0)
>>> bitfield[0]
True
>>> bitfield.toggle(0)
>>> bitfield[0]
False

KeyHash

class pyhmmer.easel.KeyHash

A dynamically resized container to store byte keys using a hash table.

Internally uses Bob Jenkins’ one at a time hash, a simple and efficient hash function published in 1997 that exhibits avalanche behaviour.

Example

Add new keys to the key hash using the add method like you would with a Python set:

>>> kh = KeyHash()
>>> kh.add(b"key")
0

Check if a key hash contains a given key:

>>> b"key" in kh
True
>>> b"missing" in kh
False

Get the index associated with a key using the indexing notation:

>>> kh[b"key"]
0
>>> kh[b"missing"]
Traceback (most recent call last):
  ...
KeyError: b'missing'

See also

The Wikipedia article for Bob Jenkins’ hash functions: https://en.wikipedia.org/wiki/Jenkins_hash_function

__init__()

Create a new empty key-hash collection.

add(item)

Add a new key to the hash table, and return its index.

If key was already in the hash table, the previous index is returned:

>>> kh = KeyHash()
>>> kh.add(b"first")
0
>>> kh.add(b"second")
1
>>> kh.add(b"first")
0
Parameters

key (bytes) – The key to add to the hash table.

Returns

int – The index corresponding to the added key.

New in version 0.3.0.

clear()

Remove all entries from the collection.

copy()

Create and return an exact copy of this mapping.

Example

>>> kh = KeyHash()
>>> kh.add(b"key")
0
>>> copy = kh.copy()
>>> b"key" in copy
True

Multiple Sequence Alignment

class pyhmmer.easel.MSA

An abstract alignment of multiple sequences.

Hint

Use len(msa) to get the number of columns in the alignment, and len(msa.sequences) to get the number of sequences (i.e. the number of rows).

checksum()

Calculate a 32-bit checksum for the multiple sequence alignment.

write(fh, format)

Write the multiple sequence alignement to a file handle.

Parameters
  • fh (io.IOBase) – A Python file handle, opened in binary mode.

  • format (str) – The name of the multiple sequence alignment file format to use.

New in version 0.3.0.

accession

The accession of the alignment, if any.

Type

bytes or None

author

The author of the alignment, if any.

Type

bytes or None

description

The description of the sequence, if any.

Type

bytes or None

name

The name of the alignment, if any.

Type

bytes or None

class pyhmmer.easel.TextMSA(MSA)

A multiple sequence alignement stored in text mode.

__init__(name=None, description=None, accession=None, sequences=None, author=None)

Create a new text-mode alignment with the given sequences.

Parameters
  • name (bytes, optional) – The name of the alignment, if any.

  • description (bytes, optional) – The description of the alignment, if any.

  • accession (bytes, optional) – The accession of the alignment, if any.

  • sequences (iterable of TextSequence) – The sequences to store in the multiple sequence alignment. All sequences must have the same length. They also need to have distinct names.

  • author (bytes, optional) – The author of the alignment, often used to record the aligner it was created with.

Raises

Example

>>> s1 = TextSequence(name=b"seq1", sequence="ATGC")
>>> s2 = TextSequence(name=b"seq2", sequence="ATGC")
>>> msa = TextMSA(name=b"msa", sequences=[s1, s2])
>>> len(msa)
4

Changed in version 0.3.0: Allow creating an alignment from an iterable of TextSequence.

copy()

Duplicate the text sequence alignment, and return the copy.

digitize(alphabet)

Convert the text alignment to a digital alignment using alphabet.

Returns

DigitalMSA – An alignment in digital mode containing the same sequences digitized with alphabet.

sequences

A view of the sequences in the alignment.

This property lets you access the individual sequences in the multiple sequence alignment as TextSequence instances.

Example

Query the number of sequences in the alignment with len, or access individual members via indexing notation:

>>> s1 = TextSequence(name=b"seq1", sequence="ATGC")
>>> s2 = TextSequence(name=b"seq2", sequence="ATGC")
>>> msa = TextMSA(name=b"msa", sequences=[s1, s2])
>>> len(msa.sequences)
2
>>> msa.sequences[0].name
b'seq1'

Caution

Sequences in the list are copies, so editing their attributes will have no effect on the alignment:

>>> msa.sequences[0].name
b'seq1'
>>> msa.sequences[0].name = b"seq1bis"
>>> msa.sequences[0].name
b'seq1'

Support for this feature will be added in a future version, but can be circumvented for now by forcingly setting the updated version of the object:

>>> seq = msa.sequences[0]
>>> seq.name = b"seq1bis"
>>> msa.sequences[0] = seq
>>> msa.sequences[0].name
b'seq1bis'

New in version 0.3.0.

Type

_TextMSASequences

class pyhmmer.easel.DigitalMSA(MSA)

A multiple sequence alignment stored in digital mode.

alphabet

The biological alphabet used to encode this sequence alignment to digits.

Type

Alphabet

__init__(alphabet, name=None, description=None, accession=None, sequences=None, author=None)

Create a new digital-mode alignment with the given sequences.

Parameters
  • alphabet (Alphabet) – The alphabet of the alignmed sequences.

  • name (bytes, optional) – The name of the alignment, if any.

  • description (bytes, optional) – The description of the alignment, if any.

  • accession (bytes, optional) – The accession of the alignment, if any.

  • sequences (iterable of DigitalSequence) – The sequences to store in the multiple sequence alignment. All sequences must have the same length and alphabet. They also need to have distinct names set.

  • author (bytes, optional) – The author of the alignment, often used to record the aligner it was created with.

Changed in version 0.3.0: Allow creating an alignment from an iterable of DigitalSequence.

copy()

Duplicate the digital sequence alignment, and return the copy.

textize()

Convert the digital alignment to a text alignment.

Returns

TextMSA – A copy of the alignment in text-mode.

New in version 0.3.0.

sequences

A view of the sequences in the alignment.

This property lets you access the individual sequences in the multiple sequence alignment as DigitalSequence instances.

See also

The documentation for the TextMSA.sequences property, which contains some additional information.

New in version 0.3.0.

Type

_DigitalMSASequences

Sequence

class pyhmmer.easel.Sequence

An abstract biological sequence with some associated metadata.

Easel provides two different mode to store a sequence: text, or digital. In the HMMER code, changing from one mode to another mode is done in place, which allows recycling memory. However, doing so can be confusing since there is no way to know statically the representation of a sequence.

To avoid this, pyhmmer provides two subclasses of the Sequence abstract class to maintain the mode contract: TextSequence and DigitalSequence. Functions expecting sequences in digital format, like pyhmmer.hmmsearch, can then use Python type system to make sure they receive sequences in the right mode. This allows type checkers such as mypy to detect potential contract breaches at compile-time.

checksum()

Calculate a 32-bit checksum for the sequence.

clear()

Reinitialize the sequence for re-use.

copy()

Duplicate the sequence, and return the copy.

write(fh)

Write the sequence alignement to a file handle, in FASTA format.

Parameters

fh (io.IOBase) – A Python file handle, opened in binary mode.

New in version 0.3.0.

accession

The accession of the sequence.

Type

bytes

description

The description of the sequence.

Type

bytes

name

The name of the sequence.

Type

bytes

source

The source of the sequence, if any.

Type

bytes

class pyhmmer.easel.TextSequence(Sequence)

A biological sequence stored in text mode.

Hint

Use the sequence property to access the sequence letters as a Python string.

__init__(name=None, description=None, accession=None, sequence=None, source=None)

Create a new text-mode sequence with the given attributes.

copy()

Duplicate the text sequence, and return the copy.

digitize(alphabet)

Convert the text sequence to a digital sequence using alphabet.

Returns

DigitalSequence – A copy of the sequence in digital-model, digitized with alphabet.

reverse_complement()

Build the reverse complement of the sequence.

This method assumes that the sequence alphabet is IUPAC/DNA. If the sequence contains any unknown letters, they will be replaced by \(N\) in the reverse-complement.

Parameters

inplace (bool) – Whether or not to copy the sequence before computing its reverse complement. With False (the default), the method will return a copy of the sequence that has been reverse-complemented. With True, it will reverse-complement inplace and return None.

Raises

UserWarning – When the sequence contains unknown characters.

Example

>>> seq = TextSequence(sequence="ATGC")
>>> seq.reverse_complement().sequence
'GCAT'

Caution

The copy made when inplace is False is an exact copy, so the name, description and accession of the copy will be the same. This could lead to duplicates if you’re not careful!

New in version 0.3.0.

sequence

The raw sequence letters, as a Python string.

Type

str

class pyhmmer.easel.DigitalSequence(Sequence)

A biological sequence stored in digital mode.

alphabet

The biological alphabet used to encode this sequence to digits.

Type

Alphabet, readonly

Hint

Use the sequence property to access the sequence digits as a memory view, allowing to access the individual bytes. This can be combined with numpy.asarray to get the sequence as an array with zero-copy.

__init__(alphabet, name=None, description=None, accession=None, sequence=None, source=None)

Create a new digital-mode sequence with the given attributes.

New in version 0.1.4.

copy()

Duplicate the digital sequence, and return the copy.

reverse_complement()

Build the reverse complement of the sequence.

Parameters

inplace (bool) – Whether or not to copy the sequence before computing its reverse complement. With False (the default), the method will return a copy of the sequence that has been reverse-complemented. With True, it will reverse-complement inplace and return None.

Raises

Caution

The copy made when inplace is False is an exact copy, so the name, description and accession of the copy will be the same. This could lead to duplicates if you’re not careful!

New in version 0.3.0.

textize()

Convert the digital sequence to a text sequence.

Returns

TextSequence – A copy of the sequence in text-mode.

New in version 0.1.4.

sequence

The raw sequence digits, as a memory view.

Type

memoryview

Sequence File

class pyhmmer.easel.SequenceFile

A wrapper around a sequence file, containing unaligned sequences.

This class supports reading sequences stored in different formats, such as FASTA, GenBank or EMBL. The format of each file can be automatically detected, but it is also possible to pass an explicit format specifier when the SequenceFile is instantiated.

New in version 0.2.0: The alphabet attribute.

__init__(file, format=None)

Create a new sequence file parser wrapping the given file.

Parameters
  • file (str) – The path to a file containing sequences in one of the supported file formats.

  • format (str, optional) – The format of the file, or None to autodetect. Supported values are: fasta, embl, genbank, ddbj, uniprot, ncbi, daemon, hmmpgmd, fmindex.

close()

Close the file and free the resources used by the parser.

guess_alphabet()

Guess the alphabet of an open SequenceFile.

This method tries to guess the alphabet of a sequence file by inspecting the first sequence in the file. It returns the alphabet, or None if the file alphabet cannot be reliably guessed.

Raises
  • EOFError – if the file is empty.

  • OSError – if a parse error occurred.

  • ValueError – if this methods is called after the file was closed.

parse(buffer, format)

Parse a sequence from a binary buffer using the given format.

parseinto(seq, buffer, format)

Parse a sequence from a binary buffer into seq.

read(skip_info=False, skip_sequence=False)

Read the next sequence from the file.

Parameters
  • skip_info (bool) – Pass True to disable reading the sequence metadata, and only read the sequence letters. Defaults to False.

  • skip_sequence (bool) – Pass True to disable reading the sequence letters, and only read the sequence metadata. Defaults to False.

Returns

Sequence – The next sequence in the file, or None if all sequences were read from the file.

Raises

ValueError – When attempting to read a sequence from a closed file, or when the file could not be parsed.

Hint

This method allocates a new sequence, which is not efficient in case the sequences are being read within a tight loop. Use SequenceFile.readinto with an already initialized Sequence if you can to recycle the internal buffers.

readinto(seq, skip_info=False, skip_sequence=False)

Read the next sequence from the file, using seq to store data.

Parameters
  • seq (Sequence) – A sequence object to use to store the next entry in the file. If this sequence was used before, it must be properly reset (using the Sequence.clear method) before using it again with readinto.

  • skip_info (bool) – Pass True to disable reading the sequence metadata, and only read the sequence letters. Defaults to False`.

  • skip_sequence (bool) – Pass True to disable reading the sequence letters, and only read the sequence metadata. Defaults to False.

Returns

Sequence – A reference to seq that was passed as an input, or None if no sequences are left in the file.

Raises

ValueError – When attempting to read a sequence from a closed file, or when the file could not be parsed.

Example

Use SequenceFile.readinto to loop over the sequences in a file while recycling the same Sequence buffer:

>>> with SequenceFile("vendor/hmmer/testsuite/ecori.fa") as sf:
...     seq = TextSequence()
...     while sf.readinto(seq) is not None:
...         # ... process seq here ... #
...         seq.clear()
set_digital(alphabet)

Set the SequenceFile to read in digital mode with alphabet.

This method can be called even after the first sequences have been read; it only affects subsequent sequences in the file.

Sequence / Subsequence Index

class pyhmmer.easel.SSIReader

A read-only handler for sequence/subsequence index file.

class Entry(fd, record_offset, data_offset, record_length)
property data_offset

Alias for field number 2

property fd

Alias for field number 0

property record_length

Alias for field number 3

property record_offset

Alias for field number 1

class FileInfo(name, format)
property format

Alias for field number 1

property name

Alias for field number 0

__init__(file)

Create a new SSI file reader for the file at the given location.

Parameters

file (str) – The path to a sequence/subsequence index file to read.

close()

Close the SSI file reader.

file_info(fd)

Retrieve the FileInfo of the descriptor.

find_name(key)

Retrieve the Entry for the given name.

class pyhmmer.easel.SSIWriter

A writer for sequence/subsequence index files.

__init__(file)

Create a new SSI file write for the file at the given location.

Parameters
  • file (str) – The path to a sequence/subsequence index file to write.

  • exclusive (bool) – Whether or not to create a file if one does not exist.

Raises
add_alias(alias, key)

Make alias an alias of key in the index.

add_file(filename, format=0)

Add a new file to the index.

Parameters
  • filename (str) – The name of the file to register.

  • format (int) – A format code to associate with the file, or 0.

Returns

int – The filehandle associated with the new indexed file.

add_key(key, fd, record_offset, data_offset=0, record_length=0)

Add a new entry to the index with the given key.

close()

Close the SSI file writer.