MSAFile#
- class pyhmmer.easel.SequenceFile#
A wrapper around a sequence file, containing unaligned sequences.
This class supports reading sequences stored in different formats, such as FASTA, GenBank or EMBL. The format of each file can be automatically detected, but it is also possible to pass an explicit format specifier when the
SequenceFileis instantiated.Hint
SequenceFileobjects can also be used to parse files containing multiple sequence alignments: in that case, the sequences will be read sequentially, removing the gap characters:>>> with SequenceFile("tests/data/msa/LuxC.sto") as sf: ... sequences = sf.read_block() >>> print(sequences[0].name[:6], sequences[0].sequence[:30]) Q9KV99 LANQPLEAILGLINEARKSWSSTPELDPYR >>> print(sequences[1].name[:6], sequences[1].sequence[:30]) Q2WLE3 IYSYPSEAMIEIINEYSKILCSDRKFLSYE
Added in version 0.2.0: The
alphabetattribute.Changed in version 0.4.8: Support reading sequences from a file-like handle. Support reading individual sequences from an MSA file.
- __init__(file, format=None, *, digital=False, alphabet=None)#
Create a new sequence file parser wrapping the given
file.- Parameters:
file (
stror file-like object) – Either the path to a file containing the sequences to read, or a file-like object opened in binary mode.format (
str, optional) – The format of the file, orNoneto autodetect. Supported values are:fasta,embl,genbank,ddbj,uniprot,ncbi,daemon,hmmpgmd,fmindex, plus any format also supported byMSAFile.digital (
bool) – Whether to read the sequences in text or digital mode. This will affect the type ofSequenceobjects returned later by thereadfunction.alphabet (
Alphabet) – The alphabet to use to digitize the sequences while reading. IfNonegiven, it will be guessed based on the contents of the first sequence.
- Raises:
ValueError – When
formatis not a valid sequence format.OSError – If an internal parser error occurred while guessing the alphabet or the format.
Caution
SequenceFilecan generally read sequences from binary-mode file-like objects, except for sequences in an NCBI BLAST database, since it is composed of multiple files. Reading from an NCBI BLAST database passed from a filename is however supported.Changed in version 0.4.4: Added the
ignore_gapsparameter.Changed in version 0.4.8: Support reading from a file-like object (except NCBI format).
Changed in version 0.5.0: Added the
digitalandalphabetkeyword arguments.Deprecated since version 0.6.0: The
ignore_gapskeyword argument, useafaformat instead.Changed in version 0.8.0: Removed the
ignore_gapskeyword argument.
- close()#
Close the file and free the resources used by the parser.
- guess_alphabet()#
Guess the alphabet of an open
SequenceFile.This method tries to guess the alphabet of a sequence file by inspecting the first sequence in the file. It returns the alphabet, or
Noneif the file alphabet cannot be reliably guessed.- Raises:
EOFError – if the file is empty.
OSError – if a parse error occurred.
ValueError – if this methods is called on a closed file.
Example
>>> with SequenceFile("tests/data/seqs/bmyD.fna") as sf: ... sf.guess_alphabet() DNA() >>> with SequenceFile("tests/data/seqs/LuxC.faa") as sf: ... sf.guess_alphabet() AA()
Added in version 0.6.3.
- classmethod parse(buffer, format, *, alphabet=None)#
Parse a sequence from a binary
bufferusing the givenformat.- Argument:
- buffer (
str,bytesor byte-like buffer): A buffer containing the sequence data to parse. In addition to
str, any type implementing the buffer protocol (such asbytes,bytearray, ormemoryview) is supported and interpreted as containing an ASCII string.- format (
str): The format of the sequence data. See the SequenceFile.__init__documentation for allowed values.
- buffer (
- Keyword Arguments:
alphabet (
Alphabet) – The alphabet to use to digitize the returned sequence, if desired.- Returns:
Sequence– The sequence parsed from the buffer, either as aDigitalSequenceif an alphabet was provided, or as aTextSequenceifNonewas given.- Raises:
ValueError – When
formatis not a valid sequence format.OSError – If an internal parser error occurred while guessing the alphabet or the format.
- classmethod parseinto(seq, buffer, format)#
Parse a sequence from a binary
bufferintoseq.- Argument:
- seq (
Sequence): The sequence object into which the deseriazlied sequence data will be written.
- buffer (
str,bytesor byte-like buffer): A buffer containing the sequence data to parse. In addition to
str, any type implementing the buffer protocol (such asbytes,bytearray, ormemoryview) is supported and interpreted as containing an ASCII string.- format (
str): The format of the sequence data. See the SequenceFile.__init__documentation for allowed values.
- seq (
- Raises:
ValueError – When
formatis not a valid sequence format.OSError – If an internal parser error occurred while guessing the alphabet or the format.
- Returns:
Sequence– The sequence given as argument, orNoneif the end of the file was reached.
- read(skip_info=False, skip_sequence=False)#
Read the next sequence from the file.
- Parameters:
- Returns:
Sequence– The next sequence in the file, orNoneif all sequences were read from the file.- Raises:
ValueError – When attempting to read a sequence from a closed file, or when the file could not be parsed.
Hint
This method allocates a new sequence, which is not efficient in case the sequences are being read within a tight loop. Use
SequenceFile.readintowith an already initializedSequenceif you can to recycle the internal buffers.
- read_block(sequences=None, residues=None)#
Read several sequences into a sequence block.
- Parameters:
sequences (
int, optional) – The maximum number of sequences to read before returning a block. Leave asNoneto read all remaining sequences from the file.residues (
int, optional) – The number of residues to read before returning the block. Leave asNoneto keep reading sequences without a residue limit.
- Returns:
SequenceBlock– A sequence block object, which may be empty if there are no sequences to read anymore. The concrete type depends on whether theSequenceFilewas opened in text or digital mode.- Raises:
ValueError – When attempting to read a sequence from a closed file, or when the file could not be parsed.
Example
Read a block of at most 4 sequences from a sequence file:
>>> with SequenceFile("tests/data/seqs/LuxC.faa") as sf: ... block = sf.read_block(sequences=4) >>> len(block) 4
Read sequences until the block contains at least 1000 residues:
>>> with SequenceFile("tests/data/seqs/LuxC.faa") as sf: ... block = sf.read_block(residues=1000) >>> len(block) 3 >>> len(block[0]) + len(block[1]) + len(block[2]) 1444
Note that the last sequence will not be truncated, so the block will always contain more than
max_residuesunless the end of the file was reached.Added in version 0.7.0.
- readinto(seq, skip_info=False, skip_sequence=False)#
Read the next sequence from the file, using
seqto store data.- Parameters:
seq (
Sequence) – A sequence object to use to store the next entry in the file. If this sequence was used before, it must be properly reset (using theSequence.clearmethod) before using it again withreadinto.skip_info (
bool) – PassTrueto disable reading the sequence metadata, and only read the sequence letters. Defaults toFalse.skip_sequence (
bool) – PassTrueto disable reading the sequence letters, and only read the sequence metadata. Defaults toFalse.
- Returns:
Sequence– A reference toseqthat was passed as an input, orNoneif no sequences are left in the file.- Raises:
ValueError – When attempting to read a sequence from a closed file, or when the file could not be parsed.
Example
Use
SequenceFile.readintoto loop over the sequences in a file while recycling the sameSequencebuffer:>>> with SequenceFile("tests/data/seqs/LuxC.faa") as sf: ... seq = TextSequence() ... while sf.readinto(seq) is not None: ... # ... process seq here ... # ... seq.clear()
- rewind()#
Rewind the file back to the beginning.
For sequential formats, this method is supported for both path-based and file object-based sequence files. For multiple-sequence alignment formats, the underlying
MSAFileneeds to be reopened, so this is only supported for path-based files.- Raises:
io.UnsupportedOperation – When attempting to rewind a sequence file where the underlying stream is a file-like object that does not support the
seekmethod.
- closed#
Whether the
SequenceFileis closed or not.- Type:
- digital#
Whether the
SequenceFileis in digital mode or not.Added in version 0.5.0.
- Type:
- format#
The format of the
SequenceFile.- Type:
- indexed#
The indexed sequences.
This property can be used to access the sequences in a file by name, provided they are indexed in a SSI file (exposed as a
SSIReaderin theSequenceFile.indexattribute).In the case where the sequence file has no associated index (because it was open from a file, or because the associated index could not be opened successfully), this attribute is also
None.Example
>>> file = SequenceFile("tests/data/seqs/938293.PRJEB85.HG003687.faa") >>> file.index <pyhmmer.easel.SSIReader object at 0x...> >>> seq = file.indexed['938293.PRJEB85.HG003684_29'] >>> print(seq.name, seq.sequence[:30]) 938293.PRJEB85.HG003684_29 MESINKKIEDIMTKNTGKIFSINDFYGLGT >>> file.close()
Added in version 0.12.0.
- class pyhmmer.easel.MSAFile#
A wrapper around a multiple-alignment file.
This class supports reading sequences stored in different formats, such as Stockholm, A2M, PSI-BLAST or Clustal.
- name#
The name of the MSA file, if it was created from a filename, or
Noneif it wraps a file-like object.- Type:
str, optional
Hint
Some Clustal files created by alignment tools other than Clustal (such as MUSCLE or MAFFT, for instance), may not contain the header expected by Easel for the Clustal format. If you get an error while trying to parse these files, use the
"clustallike"format instead of the"clustal"format when creating theMSAFile.- __init__(file, format=None, *, digital=False, alphabet=False)#
Create a new MSA file parser wrapping the given
file.- Parameters:
file (
stror file-like object) – Either the path to a file containing the sequences to read, or a file-like object opened in binary mode.format (
str, optional) – The format of the file, orNoneto autodetect. Supported values are:stockholm,pfam,a2m,psiblast,selex,afa(aligned FASTA),clustal,clustallike,phylip,phylips.
- Keyword Arguments:
digital (
bool) – Whether to read the sequences in text or digital mode. This will affect the type ofMSAobjects returned later by thereadfunction.alphabet (
Alphabet) – The alphabet to use to digitize the sequences while reading. IfNonegiven, it will be guessed based on the contents of the first sequence.
- Raises:
ValueError – When
formatis not a valid MSA format.
Changed in version 0.4.8: Support reading from a file-like object.
Changed in version 0.5.0: Added the
digitalandalphabetkeyword arguments.
- close()#
Close the file and free the resources used by the parser.
- guess_alphabet()#
Guess the alphabet of an open
MSAFile.This method tries to guess the alphabet of a multiple-alignment file by inspecting the first entry in the file. It returns the alphabet, or
Noneif the file alphabet cannot be reliably guessed.- Raises:
EOFError – if the file is empty.
OSError – if a parse error occurred.
ValueError – if this methods is called after the file was closed.
Example
>>> with MSAFile("tests/data/msa/LuxC.sto") as mf: ... mf.guess_alphabet() AA()
- classmethod parse(buffer, format=None, *, alphabet=None)#
Parse a
MSAfrom a binarybufferusing the givenformat.- Argument:
- buffer (
str,bytesor byte-like buffer): A buffer containing the sequence data to parse. In addition to
str, any type implementing the buffer protocol (such asbytes,bytearray, ormemoryview) is supported and interpreted as containing an ASCII string.- format (
str, optional): The format of the sequence data. See the
MSAFile.__init__documentation for allowed values. IfNonegiven, will be auto detected.
- buffer (
- Keyword Arguments:
alphabet (
Alphabet) – The alphabet to use to digitize the returned sequence, if desired.- Returns:
MSA– The MSA parsed from the buffer, either as aDigitalMSAif an alphabet was provided, or as aTextMSAifNonewas given.- Raises:
ValueError – When
formatis not a valid sequence format.OSError – If an internal parser error occurred while guessing the alphabet or the format.
- read()#
Read the next alignment from the file.
- Returns:
MSA– The next alignment in the file, orNoneif all the alignments were read from the file already.- Raises:
ValueError – When attempting to read an alignment from a closed file, or when the file could not be parsed.
- indexed#
The indexed alignments.
This property can be used to access the alignments in a file by name, provided they are indexed in a SSI file (exposed as a
SSIReaderin theAlignmentFile.indexattribute).In the case where the alignment file has no associated index (because it was open from a file, or because the associated index could not be opened successfully), this attribute is also
None.Added in version 0.12.0.