MSAFile#

class pyhmmer.easel.SequenceFile#

A wrapper around a sequence file, containing unaligned sequences.

This class supports reading sequences stored in different formats, such as FASTA, GenBank or EMBL. The format of each file can be automatically detected, but it is also possible to pass an explicit format specifier when the SequenceFile is instantiated.

Hint

SequenceFile objects can also be used to parse files containing multiple sequence alignments: in that case, the sequences will be read sequentially, removing the gap characters:

>>> with SequenceFile("tests/data/msa/LuxC.sto") as sf:
...     sequences = sf.read_block()
>>> print(sequences[0].name[:6], sequences[0].sequence[:30])
Q9KV99 LANQPLEAILGLINEARKSWSSTPELDPYR
>>> print(sequences[1].name[:6], sequences[1].sequence[:30])
Q2WLE3 IYSYPSEAMIEIINEYSKILCSDRKFLSYE

Added in version 0.2.0: The alphabet attribute.

Changed in version 0.4.8: Support reading sequences from a file-like handle. Support reading individual sequences from an MSA file.

__init__(file, format=None, *, digital=False, alphabet=None)#

Create a new sequence file parser wrapping the given file.

Parameters:

file (str or file-like object) – Either the path to a file containing the sequences to read, or a file-like object opened in binary mode.
format (str, optional) – The format of the file, or None to autodetect. Supported values are: fasta, embl, genbank, ddbj, uniprot, ncbi, daemon, hmmpgmd, fmindex, plus any format also supported by MSAFile.
digital (bool) – Whether to read the sequences in text or digital mode. This will affect the type of Sequence objects returned later by the read function.
alphabet (Alphabet) – The alphabet to use to digitize the sequences while reading. If None given, it will be guessed based on the contents of the first sequence.

Raises:

ValueError – When format is not a valid sequence format.
OSError – If an internal parser error occurred while guessing the alphabet or the format.

Caution

SequenceFile can generally read sequences from binary-mode file-like objects, except for sequences in an NCBI BLAST database, since it is composed of multiple files. Reading from an NCBI BLAST database passed from a filename is however supported.

Changed in version 0.4.4: Added the ignore_gaps parameter.

Changed in version 0.4.8: Support reading from a file-like object (except NCBI format).

Changed in version 0.5.0: Added the digital and alphabet keyword arguments.

Deprecated since version 0.6.0: The ignore_gaps keyword argument, use afa format instead.

Changed in version 0.8.0: Removed the ignore_gaps keyword argument.

close()#: Close the file and free the resources used by the parser.

guess_alphabet()#

Guess the alphabet of an open SequenceFile.

This method tries to guess the alphabet of a sequence file by inspecting the first sequence in the file. It returns the alphabet, or None if the file alphabet cannot be reliably guessed.

Raises:

EOFError – if the file is empty.
OSError – if a parse error occurred.
ValueError – if this methods is called on a closed file.

Example

>>> with SequenceFile("tests/data/seqs/bmyD.fna") as sf:
...     sf.guess_alphabet()
DNA()
>>> with SequenceFile("tests/data/seqs/LuxC.faa") as sf:
...     sf.guess_alphabet()
AA()

Added in version 0.6.3.

classmethod parse(buffer, format, *, alphabet=None)#

Parse a sequence from a binary buffer using the given format.

Argument:

buffer (str, bytes or byte-like buffer): A buffer containing: the sequence data to parse. In addition to str, any type implementing the buffer protocol (such as bytes, bytearray, or memoryview) is supported and interpreted as containing an ASCII string.
format (str): The format of the sequence data. See the: SequenceFile.__init__ documentation for allowed values.

Keyword Arguments:

alphabet (Alphabet) – The alphabet to use to digitize the returned sequence, if desired.

Returns:

Sequence – The sequence parsed from the buffer, either as a DigitalSequence if an alphabet was provided, or as a TextSequence if None was given.

Raises:

ValueError – When format is not a valid sequence format.
OSError – If an internal parser error occurred while guessing the alphabet or the format.

classmethod parseinto(seq, buffer, format)#

Parse a sequence from a binary buffer into seq.

Argument:

seq (Sequence): The sequence object into which: the deseriazlied sequence data will be written.
buffer (str, bytes or byte-like buffer): A buffer containing: the sequence data to parse. In addition to str, any type implementing the buffer protocol (such as bytes, bytearray, or memoryview) is supported and interpreted as containing an ASCII string.
format (str): The format of the sequence data. See the: SequenceFile.__init__ documentation for allowed values.

Raises:

ValueError – When format is not a valid sequence format.
OSError – If an internal parser error occurred while guessing the alphabet or the format.

Returns:

Sequence – The sequence given as argument, or None if the end of the file was reached.

read(skip_info=False, skip_sequence=False)#

Read the next sequence from the file.

Parameters:

skip_info (bool) – Pass True to disable reading the sequence metadata, and only read the sequence letters. Defaults to False.
skip_sequence (bool) – Pass True to disable reading the sequence letters, and only read the sequence metadata. Defaults to False.

Returns:

Sequence – The next sequence in the file, or None if all sequences were read from the file.

Raises:

ValueError – When attempting to read a sequence from a closed file, or when the file could not be parsed.

Hint

This method allocates a new sequence, which is not efficient in case the sequences are being read within a tight loop. Use SequenceFile.readinto with an already initialized Sequence if you can to recycle the internal buffers.

read_block(sequences=None, residues=None)#

Read several sequences into a sequence block.

Parameters:

sequences (int, optional) – The maximum number of sequences to read before returning a block. Leave as None to read all remaining sequences from the file.
residues (int, optional) – The number of residues to read before returning the block. Leave as None to keep reading sequences without a residue limit.

Returns:

SequenceBlock – A sequence block object, which may be empty if there are no sequences to read anymore. The concrete type depends on whether the SequenceFile was opened in text or digital mode.

Raises:

ValueError – When attempting to read a sequence from a closed file, or when the file could not be parsed.

Example

Read a block of at most 4 sequences from a sequence file:

>>> with SequenceFile("tests/data/seqs/LuxC.faa") as sf:
...     block = sf.read_block(sequences=4)
>>> len(block)
4

Read sequences until the block contains at least 1000 residues:

>>> with SequenceFile("tests/data/seqs/LuxC.faa") as sf:
...     block = sf.read_block(residues=1000)
>>> len(block)
3
>>> len(block[0]) + len(block[1]) + len(block[2])
1444

Note that the last sequence will not be truncated, so the block will always contain more than max_residues unless the end of the file was reached.

Added in version 0.7.0.

readinto(seq, skip_info=False, skip_sequence=False)#

Read the next sequence from the file, using seq to store data.

Parameters:

seq (Sequence) – A sequence object to use to store the next entry in the file. If this sequence was used before, it must be properly reset (using the Sequence.clear method) before using it again with readinto.
skip_info (bool) – Pass True to disable reading the sequence metadata, and only read the sequence letters. Defaults to False.
skip_sequence (bool) – Pass True to disable reading the sequence letters, and only read the sequence metadata. Defaults to False.

Returns:

Sequence – A reference to seq that was passed as an input, or None if no sequences are left in the file.

Raises:

ValueError – When attempting to read a sequence from a closed file, or when the file could not be parsed.

Example

Use SequenceFile.readinto to loop over the sequences in a file while recycling the same Sequence buffer:

>>> with SequenceFile("tests/data/seqs/LuxC.faa") as sf:
...     seq = TextSequence()
...     while sf.readinto(seq) is not None:
...         # ... process seq here ... #
...         seq.clear()

rewind()#

Rewind the file back to the beginning.

For sequential formats, this method is supported for both path-based and file object-based sequence files. For multiple-sequence alignment formats, the underlying MSAFile needs to be reopened, so this is only supported for path-based files.

Raises:: io.UnsupportedOperation – When attempting to rewind a sequence file where the underlying stream is a file-like object that does not support the seek method.

closed#

Whether the SequenceFile is closed or not.

Type:: bool

digital#

Whether the SequenceFile is in digital mode or not.

Added in version 0.5.0.

Type:: bool

format#

The format of the SequenceFile.

Type:: str

indexed#

The indexed sequences.

This property can be used to access the sequences in a file by name, provided they are indexed in a SSI file (exposed as a SSIReader in the SequenceFile.index attribute).

In the case where the sequence file has no associated index (because it was open from a file, or because the associated index could not be opened successfully), this attribute is also None.

Example

>>> file = SequenceFile("tests/data/seqs/938293.PRJEB85.HG003687.faa")
>>> file.index
<pyhmmer.easel.SSIReader object at 0x...>
>>> seq = file.indexed['938293.PRJEB85.HG003684_29']
>>> print(seq.name, seq.sequence[:30])
938293.PRJEB85.HG003684_29 MESINKKIEDIMTKNTGKIFSINDFYGLGT
>>> file.close()

Added in version 0.12.0.

Type:: Mapping or None

class pyhmmer.easel.MSAFile#

A wrapper around a multiple-alignment file.

This class supports reading sequences stored in different formats, such as Stockholm, A2M, PSI-BLAST or Clustal.

name#

The name of the MSA file, if it was created from a filename, or None if it wraps a file-like object.

Type:: str, optional

Hint

Some Clustal files created by alignment tools other than Clustal (such as MUSCLE or MAFFT, for instance), may not contain the header expected by Easel for the Clustal format. If you get an error while trying to parse these files, use the "clustallike" format instead of the "clustal" format when creating the MSAFile.

__init__(file, format=None, *, digital=False, alphabet=False)#

Create a new MSA file parser wrapping the given file.

Parameters:

file (str or file-like object) – Either the path to a file containing the sequences to read, or a file-like object opened in binary mode.
format (str, optional) – The format of the file, or None to autodetect. Supported values are: stockholm, pfam, a2m, psiblast, selex, afa (aligned FASTA), clustal, clustallike, phylip, phylips.

Keyword Arguments:

digital (bool) – Whether to read the sequences in text or digital mode. This will affect the type of MSA objects returned later by the read function.
alphabet (Alphabet) – The alphabet to use to digitize the sequences while reading. If None given, it will be guessed based on the contents of the first sequence.

Raises:

ValueError – When format is not a valid MSA format.

Changed in version 0.4.8: Support reading from a file-like object.

Changed in version 0.5.0: Added the digital and alphabet keyword arguments.

close()#: Close the file and free the resources used by the parser.

guess_alphabet()#

Guess the alphabet of an open MSAFile.

This method tries to guess the alphabet of a multiple-alignment file by inspecting the first entry in the file. It returns the alphabet, or None if the file alphabet cannot be reliably guessed.

Raises:

EOFError – if the file is empty.
OSError – if a parse error occurred.
ValueError – if this methods is called after the file was closed.

Example

>>> with MSAFile("tests/data/msa/LuxC.sto") as mf:
...     mf.guess_alphabet()
AA()

classmethod parse(buffer, format=None, *, alphabet=None)#

Parse a MSA from a binary buffer using the given format.

Argument:

buffer (str, bytes or byte-like buffer): A buffer containing: the sequence data to parse. In addition to str, any type implementing the buffer protocol (such as bytes, bytearray, or memoryview) is supported and interpreted as containing an ASCII string.
format (str, optional): The format of the sequence data. See: the MSAFile.__init__ documentation for allowed values. If None given, will be auto detected.

Keyword Arguments:

alphabet (Alphabet) – The alphabet to use to digitize the returned sequence, if desired.

Returns:

MSA – The MSA parsed from the buffer, either as a DigitalMSA if an alphabet was provided, or as a TextMSA if None was given.

Raises:

ValueError – When format is not a valid sequence format.
OSError – If an internal parser error occurred while guessing the alphabet or the format.

read()#

Read the next alignment from the file.

Returns:: MSA – The next alignment in the file, or None if all the alignments were read from the file already.
Raises:: ValueError – When attempting to read an alignment from a closed file, or when the file could not be parsed.

closed#

Whether the MSAFile is closed or not.

Type:: bool

digital#

Whether the MSAFile is in digital mode or not.

Type:: bool

format#

The format of the MSAFile.

Type:: str

indexed#

The indexed alignments.

This property can be used to access the alignments in a file by name, provided they are indexed in a SSI file (exposed as a SSIReader in the AlignmentFile.index attribute).

In the case where the alignment file has no associated index (because it was open from a file, or because the associated index could not be opened successfully), this attribute is also None.

Added in version 0.12.0.

Type:: Mapping or None