Alignments#

class pyhmmer.easel.MSA#

An abstract alignment of multiple sequences.

Hint

Use len(msa) to get the number of columns in the alignment, and len(msa.sequences) to get the number of sequences (i.e. the number of rows).

checksum()#

Calculate a 32-bit checksum for the multiple sequence alignment.

compute_weights(method='pb', max_identity=0.62)#

Compute sequence weights from the alignment data.

This method provides the same functionality as the esl-weight mini-app. Three different algorithms are supported in Easel:

Position-based

The position-based weighting algorithm described in Henikoff & Henikoff (1994). It is defined for ungapped alignments, It does not give rules for dealing with gaps, nor for degenerate residue symbols. The rule here is to ignore these, and moreover (in digital mode alignments, in order to deal with deep gappy alignments) to only consider alignment columns that are “consensus” (defined below). This means that longer sequences initially get more weight; hence we do a “double normalization” in which the weights are first divided by sequence length in canonical residues (to get the average weight per residue, to compensate for that effect), then normalized to sum to \(N\).

The algorithm requires \(O(NL)\) time and \(O(LK)\) memory, for an alignment of \(L\) columns, \(N\) sequences, and alphabet size \(K\).

Gerstein/Sonnhammer/Chothia

An implementation of Gerstein et al. (1994).

The algorithm if \(O(N^2)\) memory (it requires a pairwise distance matrix) and \(O(N^3 + LN^2)\) time (\(N^3\) for a UPGMA tree building step, \(LN^2\) for distance matrix construction) for an alignment of \(N\) sequences and \(L\) columns. In the Easel implementation, the actual memory requirement is dominated by two full \(N \times N\) distances matrices. To keep the calculation under memory limits, avoid processing large alignments with this algorithm.

BLOSUM

The BLOSUM algorithm described in Henikoff & Henikoff (1992). It performs a single-linkage clustering by fractional id, defines clusters such that no two clusters have a pairwise link \(\geq\) max_identity, and assigns weights of \(\frac{1}{M_i}\) to each of the \(M_i\) sequences in each cluster $i$. The max_identity is a fractional pairwise identity in the range \(0..1\).

Parameters:
  • method (str) – The method to use for computing sequence weights. One of pb, gsc or blosum.

  • max_identity (float) – The identity threshold for clustering with the BLOSUM method.

Raises:

ValueError – When a pairwise identity calculation fails because of corrupted sequence data.

Returns:

VectorD – The newly computed weights. Note that this is a reference to the sequence_weights property, and the data may be mutated by subsequent calls to compute_weights. Use VectorD.copy to keep a hard copy if needed, e.g. for comparing between methods.

Note

The MSA may be in either digital or text mode. Digital mode is preferred so that the pairwise identity calculations deal with degenerate residue symbols properly. See also DigitalMSA.compute_weights which provides additional options controling how to determine the consensus columns in pb mode.

References

  • Henikoff, Steven, and Joria G. Henikoff. “Amino acid substitution matrices from protein blocks.” Proceedings of the National Academy of Sciences of the United States of America vol. 89,22 (1992): 10915-9. doi:10.1073/pnas.89.22.10915 PMID:1438297.

  • Henikoff, SSteven, and Joria G. Henikoff. “Position-based sequence weights.” Journal of molecular biology vol. 243,4 (1994): 574-8. doi:10.1016/0022-2836(94)90032-9.

  • Gerstein, Mark, Erik L.L. Sonnhammer and Cyrus Chothia. “Volume changes in protein evolution.” Journal of molecular biology vol. 236,4 (1994): 1067-78. doi:10.1016/0022-2836(94)90012-4. PMID:8120887.

Added in version 0.11.3.

mark_fragments(threshold)#

Mark fragmented sequences in the alignment.

This method calculates the fractional “span” of each aligned sequence (the aligned length from its first to last residue) divided by the total number aligned columns.

Parameters:

threshold (float) – The span threshold for the fragmented sequences. Sequences with a span under this will be marked as fragments.

Returns:

Bitfield – A bitfield object of size len(self.sequences) indicating which sequences of the alignment are fragments.

Note

In a TextMSA, any alphanumeric character is considered to be a residue, and any non-alphanumeric char is considered to be a gap.

Example

>>> s1 = TextSequence(name="seq1", sequence="--ATGC---")
>>> s2 = TextSequence(name="seq2", sequence="TTATCCG-T")
>>> s3 = TextSequence(name="seq3", sequence="TT-TCCGAT")
>>> msa = TextMSA(name="msa", sequences=[s1, s2, s3])
>>> msa.mark_fragments(0.5)
Bitfield([True, False, False])

Added in version 0.11.3.

select(sequences=None, columns=None)#

Select and copy a subset of the multiple sequence alignment.

Parameters:
  • sequences (iterable of int, or None) – The indices of sequences to retain in the alignment subset. If None given, retain all sequences.

  • columns (iterable of int, or None) – The indices of columns to retain in the alignment subset. If None given, retain all columns.

Raises:

IndexError – When given indices that are out of bounds for the sequences or columns.

Example

>>> s1 = TextSequence(name="seq1", sequence="ATGC")
>>> s2 = TextSequence(name="seq2", sequence="ATCC")
>>> s3 = TextSequence(name="seq3", sequence="ATGA")
>>> msa = TextMSA(name="msa", sequences=[s1, s2, s3])
>>> msa.select(sequences=[0, 2]).names
('seq1', 'seq3')
>>> tuple(msa.select(columns=range(1,4)).alignment)
('TGC', 'TCC', 'TGA')

Added in version 0.11.1.

write(fh, format)#

Write the multiple sequence alignement to a file handle.

Parameters:
  • fh (io.IOBase) – A Python file handle, opened in binary mode.

  • format (str) – The name of the multiple sequence alignment file format to use. Supported values are: stockholm, pfam (equivalent to the stockholm format but with a single alignment line per sequence), a2m, psiblast, selex, afa (aligned FASTA), clustal, clustallike (equivalent to clustal without the strict CLUSTAL 2.1 header line), phylip (interleaved), phylips (sequential).

Added in version 0.3.0.

accession#

The accession of the alignment, if any.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

author#

The author of the alignment, if any.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

description#

The description of the alignment, if any.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

indexed#

A mapping of names to sequences.

This property can be used to access the sequence of a multiple sequence alignment by name. An index is created the first time this property is accessed.

Raises:

KeyError – When attempting to create an index for an alignment containing duplicate sequence names.

Example

>>> s1 = TextSequence(name="seq1", sequence="ATGC")
>>> s2 = TextSequence(name="seq2", sequence="ATTA")
>>> msa = TextMSA(name="msa", sequences=[s1, s2])
>>> msa.indexed['seq1'].sequence
'ATGC'
>>> msa.indexed['seq3']
Traceback (most recent call last):
...
KeyError: 'seq3'

Added in version 0.11.1.

Type:

Mapping

model_mask#

The model mask (#=GC MM), if any.

Added in version 0.11.1.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

bytes or None

name#

The name of the alignment, if any.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

names#

The name of each sequence in the alignment.

Every sequence in the alignment is required to have a name, so no member of the tuple will ever be None.

Example

>>> s1 = TextSequence(name="seq1", sequence="ATGC")
>>> s2 = TextSequence(name="seq2", sequence="ATGC")
>>> msa = TextMSA(name="msa", sequences=[s1, s2])
>>> msa.names
('seq1', 'seq2')

Added in version 0.4.8.

Type:

tuple of str

posterior_probabilities#

The consensus posterior probabilities, if any.

Added in version 0.11.1.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

reference#

The reference annotation (#=GC RF), if any.

Added in version 0.11.1.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

secondary_structure#

The consensus secondary structure, if any.

Added in version 0.11.1.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

sequence_weights#

The alignment sequence weights.

Sequence weights must sum to len(self.sequences). By default, every sequence receives a weight of 1.

Caution

Easel keeps an internal flag to mark whether a MSA has non-default sequence weights defined. To ensure this flag is properly set in the case of user-defined sequence weights, use the property setter rather than modifying the weights inplace:

>>> s1 = TextSequence(name="seq1", sequence="ATGC")
>>> s2 = TextSequence(name="seq2", sequence="ATCC")
>>> msa = TextMSA(name="msa", sequences=[s1, s2])
>>> msa.sequence_weights  # default weights, flag unset
VectorD([1.0, 1.0])
>>> msa.sequence_weights = VectorD([0.5, 1.5])
>>> msa.sequence_weights  # non-default weights, flag set
VectorD([0.5, 1.5])

And use the del keyword to reset the sequence weights to their default values and unset the flag:

>>> del msa.sequence_weights
>>> msa.sequence_weights  # default weights, flag unset again
VectorD([1.0, 1.0])

Added in version 0.11.3.

Type:

VectorD

surface_accessibility#

The consensus surface accessibility, if any.

Added in version 0.11.1.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

class pyhmmer.easel.TextMSA(MSA)#

A multiple sequence alignement stored in text mode.

__init__(*, name=None, description=None, accession=None, sequences=None, author=None)#

Create a new text-mode alignment with the given sequences.

Keyword Arguments:
  • name (str or bytes, optional) – The name of the alignment, if any.

  • description (str or bytes, optional) – The description of the alignment, if any.

  • accession (str or bytes, optional) – The accession of the alignment, if any.

  • sequences (collection of TextSequence) – The sequences to store in the multiple sequence alignment. All sequences must have the same length. They also need to have distinct names.

  • author (str or bytes, optional) – The author of the alignment, often used to record the aligner it was created with.

Raises:

Example

>>> s1 = TextSequence(name="seq1", sequence="ATGC")
>>> s2 = TextSequence(name="seq2", sequence="ATGC")
>>> msa = TextMSA(name="msa", sequences=[s1, s2])
>>> len(msa)
4

Changed in version 0.3.0: Allow creating an alignment from an iterable of TextSequence.

Deprecated since version 0.11.1: Passing positional arguments to constructor.

checksum()#

Calculate a 32-bit checksum for the multiple sequence alignment.

compute_weights(method='pb', max_identity=0.62)#

Compute sequence weights from the alignment data.

This method provides the same functionality as the esl-weight mini-app. Three different algorithms are supported in Easel:

Position-based

The position-based weighting algorithm described in Henikoff & Henikoff (1994). It is defined for ungapped alignments, It does not give rules for dealing with gaps, nor for degenerate residue symbols. The rule here is to ignore these, and moreover (in digital mode alignments, in order to deal with deep gappy alignments) to only consider alignment columns that are “consensus” (defined below). This means that longer sequences initially get more weight; hence we do a “double normalization” in which the weights are first divided by sequence length in canonical residues (to get the average weight per residue, to compensate for that effect), then normalized to sum to \(N\).

The algorithm requires \(O(NL)\) time and \(O(LK)\) memory, for an alignment of \(L\) columns, \(N\) sequences, and alphabet size \(K\).

Gerstein/Sonnhammer/Chothia

An implementation of Gerstein et al. (1994).

The algorithm if \(O(N^2)\) memory (it requires a pairwise distance matrix) and \(O(N^3 + LN^2)\) time (\(N^3\) for a UPGMA tree building step, \(LN^2\) for distance matrix construction) for an alignment of \(N\) sequences and \(L\) columns. In the Easel implementation, the actual memory requirement is dominated by two full \(N \times N\) distances matrices. To keep the calculation under memory limits, avoid processing large alignments with this algorithm.

BLOSUM

The BLOSUM algorithm described in Henikoff & Henikoff (1992). It performs a single-linkage clustering by fractional id, defines clusters such that no two clusters have a pairwise link \(\geq\) max_identity, and assigns weights of \(\frac{1}{M_i}\) to each of the \(M_i\) sequences in each cluster $i$. The max_identity is a fractional pairwise identity in the range \(0..1\).

Parameters:
  • method (str) – The method to use for computing sequence weights. One of pb, gsc or blosum.

  • max_identity (float) – The identity threshold for clustering with the BLOSUM method.

Raises:

ValueError – When a pairwise identity calculation fails because of corrupted sequence data.

Returns:

VectorD – The newly computed weights. Note that this is a reference to the sequence_weights property, and the data may be mutated by subsequent calls to compute_weights. Use VectorD.copy to keep a hard copy if needed, e.g. for comparing between methods.

Note

The MSA may be in either digital or text mode. Digital mode is preferred so that the pairwise identity calculations deal with degenerate residue symbols properly. See also DigitalMSA.compute_weights which provides additional options controling how to determine the consensus columns in pb mode.

References

  • Henikoff, Steven, and Joria G. Henikoff. “Amino acid substitution matrices from protein blocks.” Proceedings of the National Academy of Sciences of the United States of America vol. 89,22 (1992): 10915-9. doi:10.1073/pnas.89.22.10915 PMID:1438297.

  • Henikoff, SSteven, and Joria G. Henikoff. “Position-based sequence weights.” Journal of molecular biology vol. 243,4 (1994): 574-8. doi:10.1016/0022-2836(94)90032-9.

  • Gerstein, Mark, Erik L.L. Sonnhammer and Cyrus Chothia. “Volume changes in protein evolution.” Journal of molecular biology vol. 236,4 (1994): 1067-78. doi:10.1016/0022-2836(94)90012-4. PMID:8120887.

Added in version 0.11.3.

copy()#

Duplicate the text sequence alignment, and return the copy.

digitize(alphabet)#

Convert the text alignment to a digital alignment using alphabet.

Returns:

DigitalMSA – An alignment in digital mode containing the same sequences digitized with alphabet.

Raises:

ValueError – When the text sequence contains invalid characters that cannot be converted according to alphabet.symbols.

mark_fragments(threshold)#

Mark fragmented sequences in the alignment.

This method calculates the fractional “span” of each aligned sequence (the aligned length from its first to last residue) divided by the total number aligned columns.

Parameters:

threshold (float) – The span threshold for the fragmented sequences. Sequences with a span under this will be marked as fragments.

Returns:

Bitfield – A bitfield object of size len(self.sequences) indicating which sequences of the alignment are fragments.

Note

In a TextMSA, any alphanumeric character is considered to be a residue, and any non-alphanumeric char is considered to be a gap.

Example

>>> s1 = TextSequence(name="seq1", sequence="--ATGC---")
>>> s2 = TextSequence(name="seq2", sequence="TTATCCG-T")
>>> s3 = TextSequence(name="seq3", sequence="TT-TCCGAT")
>>> msa = TextMSA(name="msa", sequences=[s1, s2, s3])
>>> msa.mark_fragments(0.5)
Bitfield([True, False, False])

Added in version 0.11.3.

select(sequences=None, columns=None)#

Select and copy a subset of the multiple sequence alignment.

Parameters:
  • sequences (iterable of int, or None) – The indices of sequences to retain in the alignment subset. If None given, retain all sequences.

  • columns (iterable of int, or None) – The indices of columns to retain in the alignment subset. If None given, retain all columns.

Raises:

IndexError – When given indices that are out of bounds for the sequences or columns.

Example

>>> s1 = TextSequence(name="seq1", sequence="ATGC")
>>> s2 = TextSequence(name="seq2", sequence="ATCC")
>>> s3 = TextSequence(name="seq3", sequence="ATGA")
>>> msa = TextMSA(name="msa", sequences=[s1, s2, s3])
>>> msa.select(sequences=[0, 2]).names
('seq1', 'seq3')
>>> tuple(msa.select(columns=range(1,4)).alignment)
('TGC', 'TCC', 'TGA')

Added in version 0.11.1.

write(fh, format)#

Write the multiple sequence alignement to a file handle.

Parameters:
  • fh (io.IOBase) – A Python file handle, opened in binary mode.

  • format (str) – The name of the multiple sequence alignment file format to use. Supported values are: stockholm, pfam (equivalent to the stockholm format but with a single alignment line per sequence), a2m, psiblast, selex, afa (aligned FASTA), clustal, clustallike (equivalent to clustal without the strict CLUSTAL 2.1 header line), phylip (interleaved), phylips (sequential).

Added in version 0.3.0.

accession#

The accession of the alignment, if any.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

alignment#

A view of the aligned sequence data.

This property gives access to the aligned sequences, including gap characters, so that they can be displayed or processed column by column.

Examples

Use TextMSA.alignment to display an alignment in text format:

>>> for name, aligned in zip(luxc.names, luxc.alignment):
...     print(name, " ", aligned[:40], "...")
Q9KV99.1   LANQPLEAILGLINEARKSWSST------------PELDP ...
Q2WLE3.1   IYSYPSEAMIEIINEYSKILCSD------------RKFLS ...
Q97GS8.1   VHDIKTEETIDLLDRCAKLWLDDNYSKK--HIETLAQITN ...
Q3WCI9.1   LLNVPLKEIIDFLVETGERIRDPRNTFMQDCIDRMAGTHV ...
P08639.1   LNDLNINNIINFLYTTGQRWKSEEYSRRRAYIRSLITYLG ...
...

Use the splat operator (*) in combination with the zip builtin to iterate over the columns of an alignment:

>>> for idx, col in enumerate(zip(*luxc.alignment)):
...     print(idx+1, col)
1 ('L', 'I', 'V', 'L', 'L', ...)
2 ('A', 'Y', 'H', 'L', 'N', ...)
...

Added in version 0.4.8.

Changed in version 0.11.1: Change the return type to a lazy collections.abc.Sequence.

Type:

collections.abc.Sequence

author#

The author of the alignment, if any.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

description#

The description of the alignment, if any.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

indexed#

A mapping of names to sequences.

This property can be used to access the sequence of a multiple sequence alignment by name. An index is created the first time this property is accessed.

Raises:

KeyError – When attempting to create an index for an alignment containing duplicate sequence names.

Example

>>> s1 = TextSequence(name="seq1", sequence="ATGC")
>>> s2 = TextSequence(name="seq2", sequence="ATTA")
>>> msa = TextMSA(name="msa", sequences=[s1, s2])
>>> msa.indexed['seq1'].sequence
'ATGC'
>>> msa.indexed['seq3']
Traceback (most recent call last):
...
KeyError: 'seq3'

Added in version 0.11.1.

Type:

Mapping

model_mask#

The model mask (#=GC MM), if any.

Added in version 0.11.1.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

bytes or None

name#

The name of the alignment, if any.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

names#

The name of each sequence in the alignment.

Every sequence in the alignment is required to have a name, so no member of the tuple will ever be None.

Example

>>> s1 = TextSequence(name="seq1", sequence="ATGC")
>>> s2 = TextSequence(name="seq2", sequence="ATGC")
>>> msa = TextMSA(name="msa", sequences=[s1, s2])
>>> msa.names
('seq1', 'seq2')

Added in version 0.4.8.

Type:

tuple of str

posterior_probabilities#

The consensus posterior probabilities, if any.

Added in version 0.11.1.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

reference#

The reference annotation (#=GC RF), if any.

Added in version 0.11.1.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

secondary_structure#

The consensus secondary structure, if any.

Added in version 0.11.1.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

sequence_weights#

The alignment sequence weights.

Sequence weights must sum to len(self.sequences). By default, every sequence receives a weight of 1.

Caution

Easel keeps an internal flag to mark whether a MSA has non-default sequence weights defined. To ensure this flag is properly set in the case of user-defined sequence weights, use the property setter rather than modifying the weights inplace:

>>> s1 = TextSequence(name="seq1", sequence="ATGC")
>>> s2 = TextSequence(name="seq2", sequence="ATCC")
>>> msa = TextMSA(name="msa", sequences=[s1, s2])
>>> msa.sequence_weights  # default weights, flag unset
VectorD([1.0, 1.0])
>>> msa.sequence_weights = VectorD([0.5, 1.5])
>>> msa.sequence_weights  # non-default weights, flag set
VectorD([0.5, 1.5])

And use the del keyword to reset the sequence weights to their default values and unset the flag:

>>> del msa.sequence_weights
>>> msa.sequence_weights  # default weights, flag unset again
VectorD([1.0, 1.0])

Added in version 0.11.3.

Type:

VectorD

sequences#

A view of the alignment sequences.

This property lets you access the individual sequences in the multiple sequence alignment as TextSequence instances.

Example

Query the number of sequences in the alignment with len, or access individual members via indexing notation:

>>> s1 = TextSequence(name="seq1", sequence="ATGC")
>>> s2 = TextSequence(name="seq2", sequence="ATGC")
>>> msa = TextMSA(name="msa", sequences=[s1, s2])
>>> len(msa.sequences)
2
>>> msa.sequences[0].name
'seq1'

Caution

Sequences in the list are copies, so editing their attributes will have no effect on the alignment:

>>> msa.sequences[0].name
'seq1'
>>> msa.sequences[0].name = "seq1bis"
>>> msa.sequences[0].name
'seq1'

Support for this feature may be added in a future version, but can be circumvented for now by forcingly setting the updated version of the object:

>>> seq = msa.sequences[0]
>>> seq.name = "seq1bis"
>>> msa.sequences[0] = seq
>>> msa.sequences[0].name
'seq1bis'

Added in version 0.3.0.

Type:

collections.abc.Sequence

surface_accessibility#

The consensus surface accessibility, if any.

Added in version 0.11.1.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

class pyhmmer.easel.DigitalMSA(MSA)#

A multiple sequence alignment stored in digital mode.

alphabet#

The biological alphabet used to encode this sequence alignment to digits.

Type:

Alphabet

sample(alphabet, max_sequences, max_length, randomness=None)#

Sample a sequence of length at most L at random.

Parameters:
  • alphabet (Alphabet) – The alphabet of the multiple sequence alignment.

  • max_sequences (int) – The maximum number of sequences to sample for the alignment (the actual sequence number is sampled).

  • max_length (int) – The maximum length of the alignment to generate (the actual sequence length is sampled).

  • randomness (Randomness, int or None) – The random number generator to use for sampling, or a seed to initialize a generator. If None or 0 given, create a new random number generator with a random seed.

Returns:

DigitalMSA – A new digital multiple sequence alignment generated at random.

Hint

This constructor is only useful for testing and should not be used to generate random sequences to e.g. compute a background distribution for a statistical method, since this function samples alphabet residues at random irrespective of prior frequences.

Added in version 0.11.1.

__init__(alphabet, *, name=None, description=None, accession=None, sequences=None, author=None)#

Create a new digital-mode alignment with the given sequences.

Parameters:

alphabet (Alphabet) – The alphabet of the alignmed sequences.

Keyword Arguments:
  • name (bytes, optional) – The name of the alignment, if any.

  • description (bytes, optional) – The description of the alignment, if any.

  • accession (bytes, optional) – The accession of the alignment, if any.

  • sequences (iterable of DigitalSequence) – The sequences to store in the multiple sequence alignment. All sequences must have the same length and alphabet. They also need to have distinct names set.

  • author (bytes, optional) – The author of the alignment, often used to record the aligner it was created with.

Changed in version 0.3.0: Allow creating an alignment from an iterable of DigitalSequence.

Deprecated since version 0.11.1: Passing positional arguments other than alphabet.

checksum()#

Calculate a 32-bit checksum for the multiple sequence alignment.

compute_weights(method='pb', max_identity=0.62, fragment_threshold=0.5, consensus_fraction=0.5, ignore_rf=False, sample=True, sample_threshold=50000, sample_count=10000, max_fragments=5000, seed=42, preference='conscover')#

Compute sequence weights from the alignment data.

This method provides the same functionality as the esl-weight mini-app. It supports additional configuration parameters for consensus creation in position-based weighting. See MSA.compute_weights documentation for more details.

Parameters:
  • method (str) – The method to use for computing sequence weights. One of pb, gsc or blosum.

  • max_identity (float) – The identity threshold for clustering with the BLOSUM method.

Keyword Arguments:
  • fragment_threshold (float) – The threshold for determining which sequences of the alignment are fragments. An sequence spanning columns \(i\) to \(j\) of an alignment of width \(W\) will be flagged as a fragment if \(\frac{j - i}{ W } < \text{fragment_threshold}\),

  • consensus_fraction (float) – The parameter for determining with columns of the alignment are consensus columns. A column containing \(n\) symbols and \(m\) gaps will be marked a consensus column if \(\frac{n}{n + m} \ge \text{consensus_fraction}\).

  • ignore_rf (bool) – Set to True to ignore the RF line of the alignment (if present) and to force building the consensus.

  • sample (bool) – Whether or not to enable consensus determination by subsampling for large alignments. Set to False to force using all sequences.

  • sample_threshold (int) – The minimum number of sequences the alignment must contain to use subsampling for consensus determination (when sample is True).

  • sample_count (int) – The number of sequences to use when determining consensus by random subsampling.

  • max_fragments (int) – The maximum number of allowed fragments in the sample used for determining consensus. If the sample contains more than max_fragments fragments, the consensus determination is done with all sequences instead.

  • seed (int) – The seed to use for initializing the random number generator (used when preference is random or when sample is True). If 0 or None is given, an arbitrary seed will be chosen using the system clock.

  • preference (str) – The strategy to use for selecting the representative sequence in case of duplicates. Supported strategies are conscover (the default), which prefers the sequence with an alignment span that covers more consensus columns; origorder to use the first sequence in the original alignment order; and random to select the sequence at random.

Raises:

ValueError – When a pairwise identity calculation fails because of corrupted sequence data.

Returns:

VectorD – The newly computed weights. Note that this is a reference to the sequence_weights property, and the data may be mutated by subsequent calls to compute_weights. Use VectorD.copy to keep a hard copy if needed, e.g. for comparing between methods.

Added in version 0.11.3.

copy()#

Duplicate the digital sequence alignment, and return the copy.

identity_filter(max_identity=0.8, fragment_threshold=0.5, consensus_fraction=0.5, ignore_rf=False, sample=True, sample_threshold=50000, sample_count=10000, max_fragments=5000, seed=42, preference='conscover')#

Filter the alignment sequences by percent identity.

This method provides the same functionality as the esl-weight mini-app with the -f flag enabled.

Parameters:

max_identity (float) – The maximum fractional identity allowed between two alignment sequences. Sequences with fractional identity \(\geq\) this number will be removed, with the remaining one selected according to the given preference.

Keyword Arguments:
  • fragment_threshold (float) – The threshold for determining which sequences of the alignment are fragments. An sequence spanning columns \(i\) to \(j\) of an alignment of width \(W\) will be flagged as a fragment if \(\frac{j - i}{ W } < \text{fragment_threshold}\),

  • consensus_fraction (float) – The parameter for determining with columns of the alignment are consensus columns. A column containing \(n\) symbols and \(m\) gaps will be marked a consensus column if \(\frac{n}{n + m} \ge \text{consensus_fraction}\).

  • ignore_rf (bool) – Set to True to ignore the RF line of the alignment (if present) and to force building the consensus.

  • sample (bool) – Whether or not to enable consensus determination by subsampling for large alignments. Set to False to force using all sequences.

  • sample_threshold (int) – The minimum number of sequences the alignment must contain to use subsampling for consensus determination (when sample is True).

  • sample_count (int) – The number of sequences to use when determining consensus by random subsampling.

  • max_fragments (int) – The maximum number of allowed fragments in the sample used for determining consensus. If the sample contains more than max_fragments fragments, the consensus determination is done with all sequences instead.

  • seed (int) – The seed to use for initializing the random number generator (used when preference is random or when sample is True). If 0 or None is given, an arbitrary seed will be chosen using the system clock.

  • preference (str) – The strategy to use for selecting the representative sequence in case of duplicates. Supported strategies are conscover (the default), which prefers the sequence with an alignment span that covers more consensus columns; origorder to use the first sequence in the original alignment order; and random to select the sequence at random.

Returns:

MSA – The multiple sequence alignments with duplicate sequence removed. Unparsed Sotckholm markup is not propagated.

mark_fragments(threshold)#

Mark fragmented sequences in the alignment.

This method calculates the fractional “span” of each aligned sequence (the aligned length from its first to last residue) divided by the total number aligned columns.

Parameters:

threshold (float) – The span threshold for the fragmented sequences. Sequences with a span under this will be marked as fragments.

Returns:

Bitfield – A bitfield object of size len(self.sequences) indicating which sequences of the alignment are fragments.

Note

In a TextMSA, any alphanumeric character is considered to be a residue, and any non-alphanumeric char is considered to be a gap.

Example

>>> s1 = TextSequence(name="seq1", sequence="--ATGC---")
>>> s2 = TextSequence(name="seq2", sequence="TTATCCG-T")
>>> s3 = TextSequence(name="seq3", sequence="TT-TCCGAT")
>>> msa = TextMSA(name="msa", sequences=[s1, s2, s3])
>>> msa.mark_fragments(0.5)
Bitfield([True, False, False])

Added in version 0.11.3.

reverse_complement(inplace=False)#

Build the reverse complement of the MSA.

In addition to reverse-complementing the sequence data, per-column and per-residue annotation also gets reversed or reverse complemented.

Parameters:

inplace (bool) – Whether or not to copy the sequence before computing its reverse complement. With False (the default), the method will return a copy of the sequence that has been reverse-complemented. With True, it will reverse-complement inplace and return None.

Raises:

ValueError – When the alphabet of the DigitalMSA does not have a complement mapping set (e.g., Alphabet.amino).

Caution

The copy made when inplace is False is an exact copy, so the name, description and accession of the copy will be the same. This could lead to duplicates if you’re not careful!

Added in version 0.11.3.

select(sequences=None, columns=None)#

Select and copy a subset of the multiple sequence alignment.

Parameters:
  • sequences (iterable of int, or None) – The indices of sequences to retain in the alignment subset. If None given, retain all sequences.

  • columns (iterable of int, or None) – The indices of columns to retain in the alignment subset. If None given, retain all columns.

Raises:

IndexError – When given indices that are out of bounds for the sequences or columns.

Example

>>> s1 = TextSequence(name="seq1", sequence="ATGC")
>>> s2 = TextSequence(name="seq2", sequence="ATCC")
>>> s3 = TextSequence(name="seq3", sequence="ATGA")
>>> msa = TextMSA(name="msa", sequences=[s1, s2, s3])
>>> msa.select(sequences=[0, 2]).names
('seq1', 'seq3')
>>> tuple(msa.select(columns=range(1,4)).alignment)
('TGC', 'TCC', 'TGA')

Added in version 0.11.1.

textize()#

Convert the digital alignment to a text alignment.

Returns:

TextMSA – A copy of the alignment in text-mode.

Added in version 0.3.0.

write(fh, format)#

Write the multiple sequence alignement to a file handle.

Parameters:
  • fh (io.IOBase) – A Python file handle, opened in binary mode.

  • format (str) – The name of the multiple sequence alignment file format to use. Supported values are: stockholm, pfam (equivalent to the stockholm format but with a single alignment line per sequence), a2m, psiblast, selex, afa (aligned FASTA), clustal, clustallike (equivalent to clustal without the strict CLUSTAL 2.1 header line), phylip (interleaved), phylips (sequential).

Added in version 0.3.0.

accession#

The accession of the alignment, if any.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

alignment#

A view of the aligned sequence data.

This property gives access to the aligned rows of the alignment in their encoded form.

Added in version 0.11.1.

Type:

collections.abc.Sequence

author#

The author of the alignment, if any.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

description#

The description of the alignment, if any.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

indexed#

A mapping of names to sequences.

This property can be used to access the sequence of a multiple sequence alignment by name. An index is created the first time this property is accessed.

Raises:

KeyError – When attempting to create an index for an alignment containing duplicate sequence names.

Example

>>> s1 = TextSequence(name="seq1", sequence="ATGC")
>>> s2 = TextSequence(name="seq2", sequence="ATTA")
>>> msa = TextMSA(name="msa", sequences=[s1, s2])
>>> msa.indexed['seq1'].sequence
'ATGC'
>>> msa.indexed['seq3']
Traceback (most recent call last):
...
KeyError: 'seq3'

Added in version 0.11.1.

Type:

Mapping

model_mask#

The model mask (#=GC MM), if any.

Added in version 0.11.1.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

bytes or None

name#

The name of the alignment, if any.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

names#

The name of each sequence in the alignment.

Every sequence in the alignment is required to have a name, so no member of the tuple will ever be None.

Example

>>> s1 = TextSequence(name="seq1", sequence="ATGC")
>>> s2 = TextSequence(name="seq2", sequence="ATGC")
>>> msa = TextMSA(name="msa", sequences=[s1, s2])
>>> msa.names
('seq1', 'seq2')

Added in version 0.4.8.

Type:

tuple of str

posterior_probabilities#

The consensus posterior probabilities, if any.

Added in version 0.11.1.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

reference#

The reference annotation (#=GC RF), if any.

Added in version 0.11.1.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

secondary_structure#

The consensus secondary structure, if any.

Added in version 0.11.1.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None

sequence_weights#

The alignment sequence weights.

Sequence weights must sum to len(self.sequences). By default, every sequence receives a weight of 1.

Caution

Easel keeps an internal flag to mark whether a MSA has non-default sequence weights defined. To ensure this flag is properly set in the case of user-defined sequence weights, use the property setter rather than modifying the weights inplace:

>>> s1 = TextSequence(name="seq1", sequence="ATGC")
>>> s2 = TextSequence(name="seq2", sequence="ATCC")
>>> msa = TextMSA(name="msa", sequences=[s1, s2])
>>> msa.sequence_weights  # default weights, flag unset
VectorD([1.0, 1.0])
>>> msa.sequence_weights = VectorD([0.5, 1.5])
>>> msa.sequence_weights  # non-default weights, flag set
VectorD([0.5, 1.5])

And use the del keyword to reset the sequence weights to their default values and unset the flag:

>>> del msa.sequence_weights
>>> msa.sequence_weights  # default weights, flag unset again
VectorD([1.0, 1.0])

Added in version 0.11.3.

Type:

VectorD

sequences#

A view of the alignment sequences.

This property lets you access the individual sequences in the multiple sequence alignment as DigitalSequence instances.

See also

The documentation for the TextMSA.sequences property, which contains some additional information.

Added in version 0.3.0.

Type:

collections.abc.Sequence

surface_accessibility#

The consensus surface accessibility, if any.

Added in version 0.11.1.

Changed in version 0.12.0: Property is now a str instead of bytes.

Type:

str or None