Alignments#
- class pyhmmer.easel.MSA#
An abstract alignment of multiple sequences.
Hint
Use
len(msa)to get the number of columns in the alignment, andlen(msa.sequences)to get the number of sequences (i.e. the number of rows).- checksum()#
Calculate a 32-bit checksum for the multiple sequence alignment.
- compute_weights(method='pb', max_identity=0.62)#
Compute sequence weights from the alignment data.
This method provides the same functionality as the
esl-weightmini-app. Three different algorithms are supported in Easel:- Position-based
The position-based weighting algorithm described in Henikoff & Henikoff (1994). It is defined for ungapped alignments, It does not give rules for dealing with gaps, nor for degenerate residue symbols. The rule here is to ignore these, and moreover (in digital mode alignments, in order to deal with deep gappy alignments) to only consider alignment columns that are “consensus” (defined below). This means that longer sequences initially get more weight; hence we do a “double normalization” in which the weights are first divided by sequence length in canonical residues (to get the average weight per residue, to compensate for that effect), then normalized to sum to \(N\).
The algorithm requires \(O(NL)\) time and \(O(LK)\) memory, for an alignment of \(L\) columns, \(N\) sequences, and alphabet size \(K\).
- Gerstein/Sonnhammer/Chothia
An implementation of Gerstein et al. (1994).
The algorithm if \(O(N^2)\) memory (it requires a pairwise distance matrix) and \(O(N^3 + LN^2)\) time (\(N^3\) for a UPGMA tree building step, \(LN^2\) for distance matrix construction) for an alignment of \(N\) sequences and \(L\) columns. In the Easel implementation, the actual memory requirement is dominated by two full \(N \times N\) distances matrices. To keep the calculation under memory limits, avoid processing large alignments with this algorithm.
- BLOSUM
The BLOSUM algorithm described in Henikoff & Henikoff (1992). It performs a single-linkage clustering by fractional id, defines clusters such that no two clusters have a pairwise link \(\geq\)
max_identity, and assigns weights of \(\frac{1}{M_i}\) to each of the \(M_i\) sequences in each cluster $i$. Themax_identityis a fractional pairwise identity in the range \(0..1\).
- Parameters:
- Raises:
ValueError – When a pairwise identity calculation fails because of corrupted sequence data.
- Returns:
VectorD– The newly computed weights. Note that this is a reference to thesequence_weightsproperty, and the data may be mutated by subsequent calls tocompute_weights. UseVectorD.copyto keep a hard copy if needed, e.g. for comparing between methods.
Note
The MSA may be in either digital or text mode. Digital mode is preferred so that the pairwise identity calculations deal with degenerate residue symbols properly. See also
DigitalMSA.compute_weightswhich provides additional options controling how to determine the consensus columns inpbmode.References
Henikoff, Steven, and Joria G. Henikoff. “Amino acid substitution matrices from protein blocks.” Proceedings of the National Academy of Sciences of the United States of America vol. 89,22 (1992): 10915-9. doi:10.1073/pnas.89.22.10915 PMID:1438297.
Henikoff, SSteven, and Joria G. Henikoff. “Position-based sequence weights.” Journal of molecular biology vol. 243,4 (1994): 574-8. doi:10.1016/0022-2836(94)90032-9.
Gerstein, Mark, Erik L.L. Sonnhammer and Cyrus Chothia. “Volume changes in protein evolution.” Journal of molecular biology vol. 236,4 (1994): 1067-78. doi:10.1016/0022-2836(94)90012-4. PMID:8120887.
Added in version 0.11.3.
- mark_fragments(threshold)#
Mark fragmented sequences in the alignment.
This method calculates the fractional “span” of each aligned sequence (the aligned length from its first to last residue) divided by the total number aligned columns.
- Parameters:
threshold (
float) – The span threshold for the fragmented sequences. Sequences with a span under this will be marked as fragments.- Returns:
Bitfield– A bitfield object of sizelen(self.sequences)indicating which sequences of the alignment are fragments.
Note
In a
TextMSA, any alphanumeric character is considered to be a residue, and any non-alphanumeric char is considered to be a gap.Example
>>> s1 = TextSequence(name="seq1", sequence="--ATGC---") >>> s2 = TextSequence(name="seq2", sequence="TTATCCG-T") >>> s3 = TextSequence(name="seq3", sequence="TT-TCCGAT") >>> msa = TextMSA(name="msa", sequences=[s1, s2, s3]) >>> msa.mark_fragments(0.5) Bitfield([True, False, False])
Added in version 0.11.3.
- select(sequences=None, columns=None)#
Select and copy a subset of the multiple sequence alignment.
- Parameters:
- Raises:
IndexError – When given indices that are out of bounds for the sequences or columns.
Example
>>> s1 = TextSequence(name="seq1", sequence="ATGC") >>> s2 = TextSequence(name="seq2", sequence="ATCC") >>> s3 = TextSequence(name="seq3", sequence="ATGA") >>> msa = TextMSA(name="msa", sequences=[s1, s2, s3]) >>> msa.select(sequences=[0, 2]).names ('seq1', 'seq3') >>> tuple(msa.select(columns=range(1,4)).alignment) ('TGC', 'TCC', 'TGA')
Added in version 0.11.1.
- write(fh, format)#
Write the multiple sequence alignement to a file handle.
- Parameters:
fh (
io.IOBase) – A Python file handle, opened in binary mode.format (
str) – The name of the multiple sequence alignment file format to use. Supported values are:stockholm,pfam(equivalent to thestockholmformat but with a single alignment line per sequence),a2m,psiblast,selex,afa(aligned FASTA),clustal,clustallike(equivalent toclustalwithout the strictCLUSTAL 2.1header line),phylip(interleaved),phylips(sequential).
Added in version 0.3.0.
- indexed#
A mapping of names to sequences.
This property can be used to access the sequence of a multiple sequence alignment by name. An index is created the first time this property is accessed.
- Raises:
KeyError – When attempting to create an index for an alignment containing duplicate sequence names.
Example
>>> s1 = TextSequence(name="seq1", sequence="ATGC") >>> s2 = TextSequence(name="seq2", sequence="ATTA") >>> msa = TextMSA(name="msa", sequences=[s1, s2]) >>> msa.indexed['seq1'].sequence 'ATGC' >>> msa.indexed['seq3'] Traceback (most recent call last): ... KeyError: 'seq3'
Added in version 0.11.1.
- Type:
- names#
The name of each sequence in the alignment.
Every sequence in the alignment is required to have a name, so no member of the
tuplewill ever beNone.Example
>>> s1 = TextSequence(name="seq1", sequence="ATGC") >>> s2 = TextSequence(name="seq2", sequence="ATGC") >>> msa = TextMSA(name="msa", sequences=[s1, s2]) >>> msa.names ('seq1', 'seq2')
Added in version 0.4.8.
- posterior_probabilities#
The consensus posterior probabilities, if any.
Added in version 0.11.1.
- secondary_structure#
The consensus secondary structure, if any.
Added in version 0.11.1.
- sequence_weights#
The alignment sequence weights.
Sequence weights must sum to
len(self.sequences). By default, every sequence receives a weight of 1.Caution
Easel keeps an internal flag to mark whether a
MSAhas non-default sequence weights defined. To ensure this flag is properly set in the case of user-defined sequence weights, use the property setter rather than modifying the weights inplace:>>> s1 = TextSequence(name="seq1", sequence="ATGC") >>> s2 = TextSequence(name="seq2", sequence="ATCC") >>> msa = TextMSA(name="msa", sequences=[s1, s2]) >>> msa.sequence_weights # default weights, flag unset VectorD([1.0, 1.0]) >>> msa.sequence_weights = VectorD([0.5, 1.5]) >>> msa.sequence_weights # non-default weights, flag set VectorD([0.5, 1.5])
And use the
delkeyword to reset the sequence weights to their default values and unset the flag:>>> del msa.sequence_weights >>> msa.sequence_weights # default weights, flag unset again VectorD([1.0, 1.0])
Added in version 0.11.3.
- Type:
- class pyhmmer.easel.TextMSA(MSA)#
A multiple sequence alignement stored in text mode.
- __init__(*, name=None, description=None, accession=None, sequences=None, author=None)#
Create a new text-mode alignment with the given
sequences.- Keyword Arguments:
name (
strorbytes, optional) – The name of the alignment, if any.description (
strorbytes, optional) – The description of the alignment, if any.accession (
strorbytes, optional) – The accession of the alignment, if any.sequences (collection of
TextSequence) – The sequences to store in the multiple sequence alignment. All sequences must have the same length. They also need to have distinct names.author (
strorbytes, optional) – The author of the alignment, often used to record the aligner it was created with.
- Raises:
ValueError – When the alignment cannot be created from the given sequences.
TypeError – When
sequencesis not an iterable ofTextSequenceobjects.
Example
>>> s1 = TextSequence(name="seq1", sequence="ATGC") >>> s2 = TextSequence(name="seq2", sequence="ATGC") >>> msa = TextMSA(name="msa", sequences=[s1, s2]) >>> len(msa) 4
Changed in version 0.3.0: Allow creating an alignment from an iterable of
TextSequence.Deprecated since version 0.11.1: Passing positional arguments to constructor.
- checksum()#
Calculate a 32-bit checksum for the multiple sequence alignment.
- compute_weights(method='pb', max_identity=0.62)#
Compute sequence weights from the alignment data.
This method provides the same functionality as the
esl-weightmini-app. Three different algorithms are supported in Easel:- Position-based
The position-based weighting algorithm described in Henikoff & Henikoff (1994). It is defined for ungapped alignments, It does not give rules for dealing with gaps, nor for degenerate residue symbols. The rule here is to ignore these, and moreover (in digital mode alignments, in order to deal with deep gappy alignments) to only consider alignment columns that are “consensus” (defined below). This means that longer sequences initially get more weight; hence we do a “double normalization” in which the weights are first divided by sequence length in canonical residues (to get the average weight per residue, to compensate for that effect), then normalized to sum to \(N\).
The algorithm requires \(O(NL)\) time and \(O(LK)\) memory, for an alignment of \(L\) columns, \(N\) sequences, and alphabet size \(K\).
- Gerstein/Sonnhammer/Chothia
An implementation of Gerstein et al. (1994).
The algorithm if \(O(N^2)\) memory (it requires a pairwise distance matrix) and \(O(N^3 + LN^2)\) time (\(N^3\) for a UPGMA tree building step, \(LN^2\) for distance matrix construction) for an alignment of \(N\) sequences and \(L\) columns. In the Easel implementation, the actual memory requirement is dominated by two full \(N \times N\) distances matrices. To keep the calculation under memory limits, avoid processing large alignments with this algorithm.
- BLOSUM
The BLOSUM algorithm described in Henikoff & Henikoff (1992). It performs a single-linkage clustering by fractional id, defines clusters such that no two clusters have a pairwise link \(\geq\)
max_identity, and assigns weights of \(\frac{1}{M_i}\) to each of the \(M_i\) sequences in each cluster $i$. Themax_identityis a fractional pairwise identity in the range \(0..1\).
- Parameters:
- Raises:
ValueError – When a pairwise identity calculation fails because of corrupted sequence data.
- Returns:
VectorD– The newly computed weights. Note that this is a reference to thesequence_weightsproperty, and the data may be mutated by subsequent calls tocompute_weights. UseVectorD.copyto keep a hard copy if needed, e.g. for comparing between methods.
Note
The MSA may be in either digital or text mode. Digital mode is preferred so that the pairwise identity calculations deal with degenerate residue symbols properly. See also
DigitalMSA.compute_weightswhich provides additional options controling how to determine the consensus columns inpbmode.References
Henikoff, Steven, and Joria G. Henikoff. “Amino acid substitution matrices from protein blocks.” Proceedings of the National Academy of Sciences of the United States of America vol. 89,22 (1992): 10915-9. doi:10.1073/pnas.89.22.10915 PMID:1438297.
Henikoff, SSteven, and Joria G. Henikoff. “Position-based sequence weights.” Journal of molecular biology vol. 243,4 (1994): 574-8. doi:10.1016/0022-2836(94)90032-9.
Gerstein, Mark, Erik L.L. Sonnhammer and Cyrus Chothia. “Volume changes in protein evolution.” Journal of molecular biology vol. 236,4 (1994): 1067-78. doi:10.1016/0022-2836(94)90012-4. PMID:8120887.
Added in version 0.11.3.
- copy()#
Duplicate the text sequence alignment, and return the copy.
- digitize(alphabet)#
Convert the text alignment to a digital alignment using
alphabet.- Returns:
DigitalMSA– An alignment in digital mode containing the same sequences digitized withalphabet.- Raises:
ValueError – When the text sequence contains invalid characters that cannot be converted according to
alphabet.symbols.
- mark_fragments(threshold)#
Mark fragmented sequences in the alignment.
This method calculates the fractional “span” of each aligned sequence (the aligned length from its first to last residue) divided by the total number aligned columns.
- Parameters:
threshold (
float) – The span threshold for the fragmented sequences. Sequences with a span under this will be marked as fragments.- Returns:
Bitfield– A bitfield object of sizelen(self.sequences)indicating which sequences of the alignment are fragments.
Note
In a
TextMSA, any alphanumeric character is considered to be a residue, and any non-alphanumeric char is considered to be a gap.Example
>>> s1 = TextSequence(name="seq1", sequence="--ATGC---") >>> s2 = TextSequence(name="seq2", sequence="TTATCCG-T") >>> s3 = TextSequence(name="seq3", sequence="TT-TCCGAT") >>> msa = TextMSA(name="msa", sequences=[s1, s2, s3]) >>> msa.mark_fragments(0.5) Bitfield([True, False, False])
Added in version 0.11.3.
- select(sequences=None, columns=None)#
Select and copy a subset of the multiple sequence alignment.
- Parameters:
- Raises:
IndexError – When given indices that are out of bounds for the sequences or columns.
Example
>>> s1 = TextSequence(name="seq1", sequence="ATGC") >>> s2 = TextSequence(name="seq2", sequence="ATCC") >>> s3 = TextSequence(name="seq3", sequence="ATGA") >>> msa = TextMSA(name="msa", sequences=[s1, s2, s3]) >>> msa.select(sequences=[0, 2]).names ('seq1', 'seq3') >>> tuple(msa.select(columns=range(1,4)).alignment) ('TGC', 'TCC', 'TGA')
Added in version 0.11.1.
- write(fh, format)#
Write the multiple sequence alignement to a file handle.
- Parameters:
fh (
io.IOBase) – A Python file handle, opened in binary mode.format (
str) – The name of the multiple sequence alignment file format to use. Supported values are:stockholm,pfam(equivalent to thestockholmformat but with a single alignment line per sequence),a2m,psiblast,selex,afa(aligned FASTA),clustal,clustallike(equivalent toclustalwithout the strictCLUSTAL 2.1header line),phylip(interleaved),phylips(sequential).
Added in version 0.3.0.
- alignment#
A view of the aligned sequence data.
This property gives access to the aligned sequences, including gap characters, so that they can be displayed or processed column by column.
Examples
Use
TextMSA.alignmentto display an alignment in text format:>>> for name, aligned in zip(luxc.names, luxc.alignment): ... print(name, " ", aligned[:40], "...") Q9KV99.1 LANQPLEAILGLINEARKSWSST------------PELDP ... Q2WLE3.1 IYSYPSEAMIEIINEYSKILCSD------------RKFLS ... Q97GS8.1 VHDIKTEETIDLLDRCAKLWLDDNYSKK--HIETLAQITN ... Q3WCI9.1 LLNVPLKEIIDFLVETGERIRDPRNTFMQDCIDRMAGTHV ... P08639.1 LNDLNINNIINFLYTTGQRWKSEEYSRRRAYIRSLITYLG ... ...
Use the splat operator (*) in combination with the
zipbuiltin to iterate over the columns of an alignment:>>> for idx, col in enumerate(zip(*luxc.alignment)): ... print(idx+1, col) 1 ('L', 'I', 'V', 'L', 'L', ...) 2 ('A', 'Y', 'H', 'L', 'N', ...) ...
Added in version 0.4.8.
Changed in version 0.11.1: Change the return type to a lazy
collections.abc.Sequence.- Type:
- indexed#
A mapping of names to sequences.
This property can be used to access the sequence of a multiple sequence alignment by name. An index is created the first time this property is accessed.
- Raises:
KeyError – When attempting to create an index for an alignment containing duplicate sequence names.
Example
>>> s1 = TextSequence(name="seq1", sequence="ATGC") >>> s2 = TextSequence(name="seq2", sequence="ATTA") >>> msa = TextMSA(name="msa", sequences=[s1, s2]) >>> msa.indexed['seq1'].sequence 'ATGC' >>> msa.indexed['seq3'] Traceback (most recent call last): ... KeyError: 'seq3'
Added in version 0.11.1.
- Type:
- names#
The name of each sequence in the alignment.
Every sequence in the alignment is required to have a name, so no member of the
tuplewill ever beNone.Example
>>> s1 = TextSequence(name="seq1", sequence="ATGC") >>> s2 = TextSequence(name="seq2", sequence="ATGC") >>> msa = TextMSA(name="msa", sequences=[s1, s2]) >>> msa.names ('seq1', 'seq2')
Added in version 0.4.8.
- posterior_probabilities#
The consensus posterior probabilities, if any.
Added in version 0.11.1.
- secondary_structure#
The consensus secondary structure, if any.
Added in version 0.11.1.
- sequence_weights#
The alignment sequence weights.
Sequence weights must sum to
len(self.sequences). By default, every sequence receives a weight of 1.Caution
Easel keeps an internal flag to mark whether a
MSAhas non-default sequence weights defined. To ensure this flag is properly set in the case of user-defined sequence weights, use the property setter rather than modifying the weights inplace:>>> s1 = TextSequence(name="seq1", sequence="ATGC") >>> s2 = TextSequence(name="seq2", sequence="ATCC") >>> msa = TextMSA(name="msa", sequences=[s1, s2]) >>> msa.sequence_weights # default weights, flag unset VectorD([1.0, 1.0]) >>> msa.sequence_weights = VectorD([0.5, 1.5]) >>> msa.sequence_weights # non-default weights, flag set VectorD([0.5, 1.5])
And use the
delkeyword to reset the sequence weights to their default values and unset the flag:>>> del msa.sequence_weights >>> msa.sequence_weights # default weights, flag unset again VectorD([1.0, 1.0])
Added in version 0.11.3.
- Type:
- sequences#
A view of the alignment sequences.
This property lets you access the individual sequences in the multiple sequence alignment as
TextSequenceinstances.Example
Query the number of sequences in the alignment with
len, or access individual members via indexing notation:>>> s1 = TextSequence(name="seq1", sequence="ATGC") >>> s2 = TextSequence(name="seq2", sequence="ATGC") >>> msa = TextMSA(name="msa", sequences=[s1, s2]) >>> len(msa.sequences) 2 >>> msa.sequences[0].name 'seq1'
Caution
Sequences in the list are copies, so editing their attributes will have no effect on the alignment:
>>> msa.sequences[0].name 'seq1' >>> msa.sequences[0].name = "seq1bis" >>> msa.sequences[0].name 'seq1'
Support for this feature may be added in a future version, but can be circumvented for now by forcingly setting the updated version of the object:
>>> seq = msa.sequences[0] >>> seq.name = "seq1bis" >>> msa.sequences[0] = seq >>> msa.sequences[0].name 'seq1bis'
Added in version 0.3.0.
- Type:
- class pyhmmer.easel.DigitalMSA(MSA)#
A multiple sequence alignment stored in digital mode.
- sample(alphabet, max_sequences, max_length, randomness=None)#
Sample a sequence of length at most
Lat random.- Parameters:
alphabet (
Alphabet) – The alphabet of the multiple sequence alignment.max_sequences (
int) – The maximum number of sequences to sample for the alignment (the actual sequence number is sampled).max_length (
int) – The maximum length of the alignment to generate (the actual sequence length is sampled).randomness (
Randomness,intorNone) – The random number generator to use for sampling, or a seed to initialize a generator. IfNoneor0given, create a new random number generator with a random seed.
- Returns:
DigitalMSA– A new digital multiple sequence alignment generated at random.
Hint
This constructor is only useful for testing and should not be used to generate random sequences to e.g. compute a background distribution for a statistical method, since this function samples alphabet residues at random irrespective of prior frequences.
Added in version 0.11.1.
- __init__(alphabet, *, name=None, description=None, accession=None, sequences=None, author=None)#
Create a new digital-mode alignment with the given
sequences.- Parameters:
alphabet (
Alphabet) – The alphabet of the alignmed sequences.- Keyword Arguments:
name (
bytes, optional) – The name of the alignment, if any.description (
bytes, optional) – The description of the alignment, if any.accession (
bytes, optional) – The accession of the alignment, if any.sequences (iterable of
DigitalSequence) – The sequences to store in the multiple sequence alignment. All sequences must have the same length and alphabet. They also need to have distinct names set.author (
bytes, optional) – The author of the alignment, often used to record the aligner it was created with.
Changed in version 0.3.0: Allow creating an alignment from an iterable of
DigitalSequence.Deprecated since version 0.11.1: Passing positional arguments other than
alphabet.
- checksum()#
Calculate a 32-bit checksum for the multiple sequence alignment.
- compute_weights(method='pb', max_identity=0.62, fragment_threshold=0.5, consensus_fraction=0.5, ignore_rf=False, sample=True, sample_threshold=50000, sample_count=10000, max_fragments=5000, seed=42, preference='conscover')#
Compute sequence weights from the alignment data.
This method provides the same functionality as the
esl-weightmini-app. It supports additional configuration parameters for consensus creation in position-based weighting. SeeMSA.compute_weightsdocumentation for more details.- Parameters:
- Keyword Arguments:
fragment_threshold (
float) – The threshold for determining which sequences of the alignment are fragments. An sequence spanning columns \(i\) to \(j\) of an alignment of width \(W\) will be flagged as a fragment if \(\frac{j - i}{ W } < \text{fragment_threshold}\),consensus_fraction (
float) – The parameter for determining with columns of the alignment are consensus columns. A column containing \(n\) symbols and \(m\) gaps will be marked a consensus column if \(\frac{n}{n + m} \ge \text{consensus_fraction}\).ignore_rf (
bool) – Set toTrueto ignore the RF line of the alignment (if present) and to force building the consensus.sample (
bool) – Whether or not to enable consensus determination by subsampling for large alignments. Set toFalseto force using all sequences.sample_threshold (
int) – The minimum number of sequences the alignment must contain to use subsampling for consensus determination (whensampleisTrue).sample_count (
int) – The number of sequences to use when determining consensus by random subsampling.max_fragments (
int) – The maximum number of allowed fragments in the sample used for determining consensus. If the sample contains more thanmax_fragmentsfragments, the consensus determination is done with all sequences instead.seed (
int) – The seed to use for initializing the random number generator (used whenpreferenceisrandomor whensampleisTrue). If0orNoneis given, an arbitrary seed will be chosen using the system clock.preference (
str) – The strategy to use for selecting the representative sequence in case of duplicates. Supported strategies areconscover(the default), which prefers the sequence with an alignment span that covers more consensus columns;origorderto use the first sequence in the original alignment order; andrandomto select the sequence at random.
- Raises:
ValueError – When a pairwise identity calculation fails because of corrupted sequence data.
- Returns:
VectorD– The newly computed weights. Note that this is a reference to thesequence_weightsproperty, and the data may be mutated by subsequent calls tocompute_weights. UseVectorD.copyto keep a hard copy if needed, e.g. for comparing between methods.
Added in version 0.11.3.
- copy()#
Duplicate the digital sequence alignment, and return the copy.
- identity_filter(max_identity=0.8, fragment_threshold=0.5, consensus_fraction=0.5, ignore_rf=False, sample=True, sample_threshold=50000, sample_count=10000, max_fragments=5000, seed=42, preference='conscover')#
Filter the alignment sequences by percent identity.
This method provides the same functionality as the
esl-weightmini-app with the-fflag enabled.- Parameters:
max_identity (
float) – The maximum fractional identity allowed between two alignment sequences. Sequences with fractional identity \(\geq\) this number will be removed, with the remaining one selected according to the givenpreference.- Keyword Arguments:
fragment_threshold (
float) – The threshold for determining which sequences of the alignment are fragments. An sequence spanning columns \(i\) to \(j\) of an alignment of width \(W\) will be flagged as a fragment if \(\frac{j - i}{ W } < \text{fragment_threshold}\),consensus_fraction (
float) – The parameter for determining with columns of the alignment are consensus columns. A column containing \(n\) symbols and \(m\) gaps will be marked a consensus column if \(\frac{n}{n + m} \ge \text{consensus_fraction}\).ignore_rf (
bool) – Set toTrueto ignore the RF line of the alignment (if present) and to force building the consensus.sample (
bool) – Whether or not to enable consensus determination by subsampling for large alignments. Set toFalseto force using all sequences.sample_threshold (
int) – The minimum number of sequences the alignment must contain to use subsampling for consensus determination (whensampleisTrue).sample_count (
int) – The number of sequences to use when determining consensus by random subsampling.max_fragments (
int) – The maximum number of allowed fragments in the sample used for determining consensus. If the sample contains more thanmax_fragmentsfragments, the consensus determination is done with all sequences instead.seed (
int) – The seed to use for initializing the random number generator (used whenpreferenceisrandomor whensampleisTrue). If0orNoneis given, an arbitrary seed will be chosen using the system clock.preference (
str) – The strategy to use for selecting the representative sequence in case of duplicates. Supported strategies areconscover(the default), which prefers the sequence with an alignment span that covers more consensus columns;origorderto use the first sequence in the original alignment order; andrandomto select the sequence at random.
- Returns:
MSA– The multiple sequence alignments with duplicate sequence removed. Unparsed Sotckholm markup is not propagated.
- mark_fragments(threshold)#
Mark fragmented sequences in the alignment.
This method calculates the fractional “span” of each aligned sequence (the aligned length from its first to last residue) divided by the total number aligned columns.
- Parameters:
threshold (
float) – The span threshold for the fragmented sequences. Sequences with a span under this will be marked as fragments.- Returns:
Bitfield– A bitfield object of sizelen(self.sequences)indicating which sequences of the alignment are fragments.
Note
In a
TextMSA, any alphanumeric character is considered to be a residue, and any non-alphanumeric char is considered to be a gap.Example
>>> s1 = TextSequence(name="seq1", sequence="--ATGC---") >>> s2 = TextSequence(name="seq2", sequence="TTATCCG-T") >>> s3 = TextSequence(name="seq3", sequence="TT-TCCGAT") >>> msa = TextMSA(name="msa", sequences=[s1, s2, s3]) >>> msa.mark_fragments(0.5) Bitfield([True, False, False])
Added in version 0.11.3.
- reverse_complement(inplace=False)#
Build the reverse complement of the MSA.
In addition to reverse-complementing the sequence data, per-column and per-residue annotation also gets reversed or reverse complemented.
- Parameters:
inplace (
bool) – Whether or not to copy the sequence before computing its reverse complement. WithFalse(the default), the method will return a copy of the sequence that has been reverse-complemented. WithTrue, it will reverse-complement inplace and returnNone.- Raises:
ValueError – When the alphabet of the
DigitalMSAdoes not have a complement mapping set (e.g.,Alphabet.amino).
Caution
The copy made when
inplaceisFalseis an exact copy, so thename,descriptionandaccessionof the copy will be the same. This could lead to duplicates if you’re not careful!Added in version 0.11.3.
- select(sequences=None, columns=None)#
Select and copy a subset of the multiple sequence alignment.
- Parameters:
- Raises:
IndexError – When given indices that are out of bounds for the sequences or columns.
Example
>>> s1 = TextSequence(name="seq1", sequence="ATGC") >>> s2 = TextSequence(name="seq2", sequence="ATCC") >>> s3 = TextSequence(name="seq3", sequence="ATGA") >>> msa = TextMSA(name="msa", sequences=[s1, s2, s3]) >>> msa.select(sequences=[0, 2]).names ('seq1', 'seq3') >>> tuple(msa.select(columns=range(1,4)).alignment) ('TGC', 'TCC', 'TGA')
Added in version 0.11.1.
- textize()#
Convert the digital alignment to a text alignment.
- Returns:
TextMSA– A copy of the alignment in text-mode.
Added in version 0.3.0.
- write(fh, format)#
Write the multiple sequence alignement to a file handle.
- Parameters:
fh (
io.IOBase) – A Python file handle, opened in binary mode.format (
str) – The name of the multiple sequence alignment file format to use. Supported values are:stockholm,pfam(equivalent to thestockholmformat but with a single alignment line per sequence),a2m,psiblast,selex,afa(aligned FASTA),clustal,clustallike(equivalent toclustalwithout the strictCLUSTAL 2.1header line),phylip(interleaved),phylips(sequential).
Added in version 0.3.0.
- alignment#
A view of the aligned sequence data.
This property gives access to the aligned rows of the alignment in their encoded form.
Added in version 0.11.1.
- Type:
- indexed#
A mapping of names to sequences.
This property can be used to access the sequence of a multiple sequence alignment by name. An index is created the first time this property is accessed.
- Raises:
KeyError – When attempting to create an index for an alignment containing duplicate sequence names.
Example
>>> s1 = TextSequence(name="seq1", sequence="ATGC") >>> s2 = TextSequence(name="seq2", sequence="ATTA") >>> msa = TextMSA(name="msa", sequences=[s1, s2]) >>> msa.indexed['seq1'].sequence 'ATGC' >>> msa.indexed['seq3'] Traceback (most recent call last): ... KeyError: 'seq3'
Added in version 0.11.1.
- Type:
- names#
The name of each sequence in the alignment.
Every sequence in the alignment is required to have a name, so no member of the
tuplewill ever beNone.Example
>>> s1 = TextSequence(name="seq1", sequence="ATGC") >>> s2 = TextSequence(name="seq2", sequence="ATGC") >>> msa = TextMSA(name="msa", sequences=[s1, s2]) >>> msa.names ('seq1', 'seq2')
Added in version 0.4.8.
- posterior_probabilities#
The consensus posterior probabilities, if any.
Added in version 0.11.1.
- secondary_structure#
The consensus secondary structure, if any.
Added in version 0.11.1.
- sequence_weights#
The alignment sequence weights.
Sequence weights must sum to
len(self.sequences). By default, every sequence receives a weight of 1.Caution
Easel keeps an internal flag to mark whether a
MSAhas non-default sequence weights defined. To ensure this flag is properly set in the case of user-defined sequence weights, use the property setter rather than modifying the weights inplace:>>> s1 = TextSequence(name="seq1", sequence="ATGC") >>> s2 = TextSequence(name="seq2", sequence="ATCC") >>> msa = TextMSA(name="msa", sequences=[s1, s2]) >>> msa.sequence_weights # default weights, flag unset VectorD([1.0, 1.0]) >>> msa.sequence_weights = VectorD([0.5, 1.5]) >>> msa.sequence_weights # non-default weights, flag set VectorD([0.5, 1.5])
And use the
delkeyword to reset the sequence weights to their default values and unset the flag:>>> del msa.sequence_weights >>> msa.sequence_weights # default weights, flag unset again VectorD([1.0, 1.0])
Added in version 0.11.3.
- Type:
- sequences#
A view of the alignment sequences.
This property lets you access the individual sequences in the multiple sequence alignment as
DigitalSequenceinstances.See also
The documentation for the
TextMSA.sequencesproperty, which contains some additional information.Added in version 0.3.0.
- Type: