MSAFile#
- class pyhmmer.easel.SequenceFile#
A wrapper around a sequence file, containing unaligned sequences.
This class supports reading sequences stored in different formats, such as FASTA, GenBank or EMBL. The format of each file can be automatically detected, but it is also possible to pass an explicit format specifier when the
SequenceFile
is instantiated.Hint
SequenceFile
objects can also be used to parse files containing multiple sequence alignments: in that case, the sequences will be read sequentially, removing the gap characters:>>> with SequenceFile("tests/data/msa/LuxC.sto") as sf: ... sequences = sf.read_block() >>> print(sequences[0].name[:6], sequences[0].sequence[:30]) b'Q9KV99' LANQPLEAILGLINEARKSWSSTPELDPYR >>> print(sequences[1].name[:6], sequences[1].sequence[:30]) b'Q2WLE3' IYSYPSEAMIEIINEYSKILCSDRKFLSYE
Added in version 0.2.0: The
alphabet
attribute.Changed in version 0.4.8: Support reading sequences from a file-like handle. Support reading individual sequences from an MSA file.
- __init__(file, format=None, *, digital=False, alphabet=None)#
Create a new sequence file parser wrapping the given
file
.- Parameters:
file (
str
or file-like object) – Either the path to a file containing the sequences to read, or a file-like object opened in binary mode.format (
str
, optional) – The format of the file, orNone
to autodetect. Supported values are:fasta
,embl
,genbank
,ddbj
,uniprot
,ncbi
,daemon
,hmmpgmd
,fmindex
, plus any format also supported byMSAFile
.digital (
bool
) – Whether to read the sequences in text or digital mode. This will affect the type ofSequence
objects returned later by theread
function.alphabet (
Alphabet
) – The alphabet to use to digitize the sequences while reading. IfNone
given, it will be guessed based on the contents of the first sequence.
- Raises:
ValueError – When
format
is not a valid sequence format.OSError – If an internal parser error occurred while guessing the alphabet or the format.
Caution
SequenceFile
can generally read sequences from binary-mode file-like objects, except for sequences in an NCBI BLAST database, since it is composed of multiple files. Reading from an NCBI BLAST database passed from a filename is however supported.Changed in version 0.4.4: Added the
ignore_gaps
parameter.Changed in version 0.4.8: Support reading from a file-like object (except NCBI format).
Changed in version 0.5.0: Added the
digital
andalphabet
keyword arguments.Deprecated since version 0.6.0: The
ignore_gaps
keyword argument, useafa
format instead.Changed in version 0.8.0: Removed the
ignore_gaps
keyword argument.
- close()#
Close the file and free the resources used by the parser.
- guess_alphabet()#
Guess the alphabet of an open
SequenceFile
.This method tries to guess the alphabet of a sequence file by inspecting the first sequence in the file. It returns the alphabet, or
None
if the file alphabet cannot be reliably guessed.- Raises:
EOFError – if the file is empty.
OSError – if a parse error occurred.
ValueError – if this methods is called on a closed file.
Example
>>> with SequenceFile("tests/data/seqs/bmyD.fna") as sf: ... sf.guess_alphabet() Alphabet.dna() >>> with SequenceFile("tests/data/seqs/LuxC.faa") as sf: ... sf.guess_alphabet() Alphabet.amino()
Added in version 0.6.3.
- classmethod parse(buffer, format, *, alphabet=None)#
Parse a sequence from a binary
buffer
using the givenformat
.- Argument:
- buffer (
bytes
or byte-like buffer): A buffer containing the sequence data to parse. Any type implementing the buffer protocol (such as
bytes
,bytearray
, ormemoryview
) is supported.- format (
str
): The format of the sequence data. See the SequenceFile.__init__
documentation for allowed values.
- buffer (
- Keyword Arguments:
alphabet (
Alphabet
) – The alphabet to use to digitize the returned sequence, if desired.- Returns:
Sequence
– The sequenced parsed from the buffer, either as aDigitalSequence
if an alphabet was provided, or as aTextSequence
ifNone
was given.- Raises:
ValueError – When
format
is not a valid sequence format.OSError – If an internal parser error occurred while guessing the alphabet or the format.
- classmethod parseinto(seq, buffer, format)#
Parse a sequence from a binary
buffer
intoseq
.- Argument:
- seq (
Sequence
): The sequence object into which the deseriazlied sequence data will be written.
- buffer (
bytes
or byte-like buffer): A buffer containing the sequence data to parse. Any type implementing the buffer protocol (such as
bytes
,bytearray
, ormemoryview
) is supported.- format (
str
): The format of the sequence data. See the SequenceFile.__init__
documentation for allowed values.
- seq (
- Raises:
ValueError – When
format
is not a valid sequence format.OSError – If an internal parser error occurred while guessing the alphabet or the format.
- Returns:
Sequence
– The sequence given as argument, orNone
if the end of the file was reached.
- read(skip_info=False, skip_sequence=False)#
Read the next sequence from the file.
- Parameters:
- Returns:
Sequence
– The next sequence in the file, orNone
if all sequences were read from the file.- Raises:
ValueError – When attempting to read a sequence from a closed file, or when the file could not be parsed.
Hint
This method allocates a new sequence, which is not efficient in case the sequences are being read within a tight loop. Use
SequenceFile.readinto
with an already initializedSequence
if you can to recycle the internal buffers.
- read_block(sequences=None, residues=None)#
Read several sequences into a sequence block.
- Parameters:
sequences (
int
, optional) – The maximum number of sequences to read before returning a block. Leave asNone
to read all remaining sequences from the file.residues (
int
, optional) – The number of residues to read before returning the block. Leave asNone
to keep reading sequences without a residue limit.
- Returns:
SequenceBlock
– A sequence block object, which may be empty if there are no sequences to read anymore. The concrete type depends on whether theSequenceFile
was opened in text or digital mode.- Raises:
ValueError – When attempting to read a sequence from a closed file, or when the file could not be parsed.
Example
Read a block of at most 4 sequences from a sequence file:
>>> with SequenceFile("tests/data/seqs/LuxC.faa") as sf: ... block = sf.read_block(sequences=4) >>> len(block) 4
Read sequences until the block contains at least 1000 residues:
>>> with SequenceFile("tests/data/seqs/LuxC.faa") as sf: ... block = sf.read_block(residues=1000) >>> len(block) 3 >>> len(block[0]) + len(block[1]) + len(block[2]) 1444
Note that the last sequence will not be truncated, so the block will always contain more than
max_residues
unless the end of the file was reached.Added in version 0.7.0.
- readinto(seq, skip_info=False, skip_sequence=False)#
Read the next sequence from the file, using
seq
to store data.- Parameters:
seq (
Sequence
) – A sequence object to use to store the next entry in the file. If this sequence was used before, it must be properly reset (using theSequence.clear
method) before using it again withreadinto
.skip_info (
bool
) – PassTrue
to disable reading the sequence metadata, and only read the sequence letters. Defaults toFalse
.skip_sequence (
bool
) – PassTrue
to disable reading the sequence letters, and only read the sequence metadata. Defaults toFalse
.
- Returns:
Sequence
– A reference toseq
that was passed as an input, orNone
if no sequences are left in the file.- Raises:
ValueError – When attempting to read a sequence from a closed file, or when the file could not be parsed.
Example
Use
SequenceFile.readinto
to loop over the sequences in a file while recycling the sameSequence
buffer:>>> with SequenceFile("tests/data/seqs/LuxC.faa") as sf: ... seq = TextSequence() ... while sf.readinto(seq) is not None: ... # ... process seq here ... # ... seq.clear()
- rewind()#
Rewind the file back to the beginning.
For sequential formats, this method is supported for both path-based and file object-based sequence files. For multiple-sequence alignment formats, the underlying
MSAFile
needs to be reopened, so this is only supported for path-based files.- Raises:
io.UnsupportedOperation – When attempting to rewind a sequence file where the underlying stream is a file-like object that does not support the
seek
method.
- closed#
Whether the
SequenceFile
is closed or not.- Type:
- digital#
Whether the
SequenceFile
is in digital mode or not.Added in version 0.5.0.
- Type:
- format#
The format of the
SequenceFile
.- Type:
- class pyhmmer.easel.MSAFile#
A wrapper around a multiple-alignment file.
This class supports reading sequences stored in different formats, such as Stockholm, A2M, PSI-BLAST or Clustal.
- name#
The name of the MSA file, if it was created from a filename, or
None
if it wraps a file-like object.- Type:
str
, optional
Hint
Some Clustal files created by alignment tools other than Clustal (such as MUSCLE or MAFFT, for instance), may not contain the header expected by Easel for the Clustal format. If you get an error while trying to parse these files, use the
"clustallike"
format instead of the"clustal"
format when creating theMSAFile
.- __init__(file, format=None, *, digital=False, alphabet=False)#
Create a new MSA file parser wrapping the given
file
.- Parameters:
file (
str
or file-like object) – Either the path to a file containing the sequences to read, or a file-like object opened in binary mode.format (
str
, optional) – The format of the file, orNone
to autodetect. Supported values are:stockholm
,pfam
,a2m
,psiblast
,selex
,afa
(aligned FASTA),clustal
,clustallike
,phylip
,phylips
.
- Keyword Arguments:
digital (
bool
) – Whether to read the sequences in text or digital mode. This will affect the type ofMSA
objects returned later by theread
function.alphabet (
Alphabet
) – The alphabet to use to digitize the sequences while reading. IfNone
given, it will be guessed based on the contents of the first sequence.
- Raises:
ValueError – When
format
is not a valid MSA format.
Changed in version 0.4.8: Support reading from a file-like object.
Changed in version 0.5.0: Added the
digital
andalphabet
keyword arguments.
- close()#
Close the file and free the resources used by the parser.
- guess_alphabet()#
Guess the alphabet of an open
MSAFile
.This method tries to guess the alphabet of a multiple-alignment file by inspecting the first entry in the file. It returns the alphabet, or
None
if the file alphabet cannot be reliably guessed.- Raises:
EOFError – if the file is empty.
OSError – if a parse error occurred.
ValueError – if this methods is called after the file was closed.
Example
>>> with MSAFile("tests/data/msa/laccase.clw") as mf: ... mf.guess_alphabet() Alphabet.amino()
- read()#
Read the next alignment from the file.
- Returns:
MSA
– The next alignment in the file, orNone
if all the alignments were read from the file already.- Raises:
ValueError – When attempting to read an alignment from a closed file, or when the file could not be parsed.