SeqAn3 3.4.0-rc.1
The Modern C++ library for sequence analysis.
|
Learning Objective:
In this tutorial, you will learn how to compute pairwise sequence alignments with SeqAn. This tutorial is a walkthrough with links to the API documentation and is also meant as a source for copy-and-paste code.
Difficulty | Intermediate |
---|---|
Duration | 60-90 min |
Prerequisite tutorials | Quick Setup (using CMake) Alphabets in SeqAn Ranges |
Recommended reading |
Aligning biological sequences is a very prominent component in many bioinformatics applications and pipelines. Well known genomic applications that use pairwise sequence alignments are read mapping, genome assembly, variant detection, multiple sequence alignment as well as protein search.
The goal of the pairwise alignment is to obtain an optimal transcript that describes how two DNA sequences are related to each other by means of substitutions, insertions, or deletions. The computed transcript describes then the operations necessary to translate the one sequence into the other, as can be seen in the following picture.
The alignment problem is solved with a dynamic programming (DP) algorithm which runs in \( (\mathcal{O}(n^2))\) time and space. Besides the global alignment approach, many more variations of this DP based algorithm have been developed over time. SeqAn unified all of these approaches into a single DP core implementation which can be extended easily and thus, with all possible configurations, is a very versatile and powerful tool to compute many desired alignment variants.
Let us first have a look at an example of computing a global alignment in SeqAn.
In the above example, we want to compute the similarity of two seqan3::dna4 sequences using a global alignment. For this, we need to import the dna4 alphabet, the seqan3::hamming_scoring_scheme and the seqan3::align_pairwise in the beginning of our file. We also import the seqan3::debug_stream which allows us to print various types in a nice formatted manner.
At the beginning of the file, we are defining our two DNA sequences s1
and s2
. If you feel puzzled about what the _dna4
suffix does, we recommend reading the Alphabets in SeqAn and Ranges before. In line 16-17 we configure the alignment job with the most simplistic configuration possible. In this case, it is a global alignment with edit distance, i.e. mismatches, insertions and deletions are scored with -1
and matches are scored with 0
. Later in this tutorial we will give a more detailed description of the configuration and how it can be used. The minimum requirement for computing a pairwise alignment is to specify the alignment method (seqan3::align_cfg::method_local or seqan3::align_cfg::method_global) and the seqan3::align_cfg::scoring_scheme configuration elements. The first one selects the internal algorithm and the second one provides the scoring scheme that should be used to score a pair of sequence characters.
Now we are going to call seqan3::align_pairwise. This interface requires two arguments: a tuple or a range of tuples with exactly two elements and the configuration object. Independent of the number of pairwise alignment jobs submitted to the algorithm it always returns a range over seqan3::alignment_result. Later we will see how we can use a continuation interface that calls a user-defined function rather than iterating over the results sequentially. Finally, we output the score for the computed alignment.
pairwise_align
does nothing as it returns a range that is evaluated in a lazy manner. Only when calling begin or incrementing the iterator over the range the alignment computation is invoked.
First, we create the vector of seqan3::dna4 sequences. We keep the configuration as is and then modify the initial code to a range-based for loop looping over the alignment results. Since the seqan3::alignment_result is a class template and the template parameters are determined during the configuration step we use auto as the result type. The current result is cached inside of the lazy range and we capture the result as const &
in order to not tamper with the result values.
Congratulations, you have computed your first pairwise alignment with SeqAn! As you can see, the interface is really simple, yet the configuration object makes it extremely flexible to conduct various different alignment calculations. In the following chapter, you will learn more about the various configuration possibilities.
The configuration object is the core of the alignment interface. It allows to easily configure the alignment algorithm without changing the interface of the pairwise alignment. It uses a seqan3::configuration object to chain different configuration elements together using the logical or-operator ('|'-operator). You can find an overview over the available configurations here. The configurations for the alignment module are available in:
The configuration elements are all classes that wrap the actual information necessary for the configuration of the alignment algorithm. Depending on the configuration specification certain features of the algorithm are enabled or disabled. Moreover, during the initialisation of the algorithm, the best implementation is chosen based on the given configurations. To avoid possible ambiguities with the configurations of other algorithms, the configuration elements for the alignment are defined in the special namespace seqan3::align_cfg.
The most straightforward algorithm is the global alignment which can be configured using the seqan3::align_cfg::method_global.
The global alignment can be further refined by initialising the seqan3::align_cfg::method_global configuration element with the free end-gap specifiers. They specify whether gaps at the end of the sequences are penalised. In SeqAn you can configure this behaviour for every end, namely for leading and trailing gaps of the first and second sequence. seqan3::align_cfg::method_global is constructed with 4 free end-gap specifiers (one for every end):
The following code snippet demonstrates the different use cases:
The order of arguments is fixed and must always be as shown in the example.
Adapt the code from Assignment 1 to compute the semi-global alignment for the case where both ends of the first sequence can be aligned to gaps without penalising them. Note that in such a semi-global alignment, the first sequence would be aligned as an infix of the second sequence.
To accomplish our goal we initialise the method_global
option with the free end specifiers for sequence 1 set to true
, and those for sequence 2 with false
.
A scoring scheme can be queried to get the score for substituting two alphabet values using the score
member function. Currently, SeqAn supports three scoring schemes. Besides seqan3::hamming_scoring_scheme, there is the seqan3::nucleotide_scoring_scheme used for aligning nucleotide sequences and seqan3::aminoacid_scoring_scheme used for aligning amino acid sequences. You can import them with the following includes:
In many biological applications, using the seqan3::hamming_scoring_scheme (Hamming Distance) is not enough as it entirely ignores the biological background of sequence alignments. If required, you can use the nucleotide scoring scheme when aligning nucleotides and the the amino acid scoring scheme when working with aminoacids. These scoring schemes allow a finer control of scoring two aligned letters. For example, you can set different match and mismatch scores for different pairs of aligned letters. By default, the scoring schemes are initialised by setting a simple scheme consisting of seqan3::match_score and seqan3::mismatch_score. But it is also possible to provide a custom matrix. The amino acid scoring scheme can additionally be initialised with a predefined substitution matrix that can be accessed via the seqan3::aminoacid_similarity_matrix enumeration class.
Similarly to the scoring scheme, you can use the seqan3::align_cfg::gap_cost_affine to customise the gap penalties used for the alignment computation. The default initialised seqan3::align_cfg::gap_cost_affine sets the score for a gap to -1
and for a gap opening to 0
. Note that the gap open score is added to the gap score when a gap is opened within the alignment computation. Therefore, setting the gap open score to 0
disables affine gaps altogether. You can pass a seqan3::align_cfg::extension_score and a seqan3::align_cfg::open_score object to initialise the scheme with custom gap penalties. The penalties can be changed later by using the respective member variables extension_score
and open_score
.
To configure the scoring scheme and the gap scheme for the alignment algorithm you need to use the seqan3::align_cfg::scoring_scheme and the seqan3::align_cfg::gap_cost_affine configurations. The seqan3::align_cfg::scoring_scheme is mandatory - similarly to the alignment method configuration. It would be wrong to assume what the default scoring scheme should be. If you do not provide these configurations, the compilation will fail with a corresponding error message. Not providing the gap scheme is ok. In this case, the default initialised gap scheme will be used for the alignment computation.
-2
and gap open costs of -9
. Use the BLOSUM62 similarity matrix.Only a few parts of our algorithm need to be adapted. First, we use an amino acid scoring scheme and initialise it with the respective similarity matrix. Second, we initialise the gap scheme to represent the affine gap model as given in the assignment. Et voilà, we have computed a pairwise alignment over aminoacid sequences.
So far, we have only used the score. However, in many situations the final alignment is required, e.g. when mapping reads and the user wishes to write the alignment to the final SAM/BAM file. In SeqAn you can simply configure what is going to be computed by the alignment algorithm using the different output configurations.
Accordingly, the alignment algorithm is configured to use the best implementation to obtain the desired result. The following table shows the different outcomes that can be configured:
Output option | Available result |
---|---|
seqan3::align_cfg::output_score | alignment score |
seqan3::align_cfg::output_end_position | end positions of the aligned sequences |
seqan3::align_cfg::output_begin_position | begin positions of the aligned sequences |
seqan3::align_cfg::output_alignment | alignment of the two sequences |
seqan3::align_cfg::output_sequence1_id | id of the first sequence |
seqan3::align_cfg::output_sequence2_id | id of the second sequence |
The final result is returned as a seqan3::alignment_result object. This object offers special member functions to access the stored values. If you try to access a value, e.g. the alignment, although you didn't specify seqan3::align_cfg::output_alignment in the output configuration, a static assertion will be triggered during compilation.
-4
and a simple scoring scheme with mismatch -2
and match 4
.
In many situations it is not necessary to compute the entire alignment matrix but only a part of it. This has positive impacts on the performance. To limit the computation space the alignment matrix can be bounded by a band. Thus, only the alignment is computed that fits in this band. Notably, this alignment does not need to be the optimal alignment. However, in many cases, we can give a rough bound on how similar the sequences will be and therefor use the banded alignment. To do so, you can configure the alignment using the seqan3::align_cfg::band_fixed_size option. This configuration element will be initialised with a seqan3::align_cfg::lower_diagonal and seqan3::align_cfg::upper_diagonal parameter.
-3
and upper diagonal set to 8
. How does the result change?
A special form of the pairwise sequence alignment is the edit distance. This distance metric counts the number of edits necessary to transform one sequence into the other. The cost model for the edit distance is fixed. In particular, the match score is 0
and the scores for a mismatch and a gap is -1
. Due to the special metric, a fast bitvector implementation can be used to compute the edit distance. This happens in SeqAn automatically if the respective configurations are used. To do so, you need to explicitly use the seqan3::hamming_scoring_scheme as well as the default gap scheme initialised with -1
for a gap and 0
for a gap open score while computing a seqan3::global_alignment. To make the configuration easier, we added a shortcut called seqan3::align_cfg::edit_scheme.
The edit_scheme
still has to be combined with an alignment method. When combining it with the seqan3::align_cfg::method_global configuration element, the edit distance algorithm can be further refined with free end-gaps (see section Global and semi-global alignment
).
false
)true
for the first and to false
for the second sequence) Using any other free end-gap configuration will disable the edit distance algorithm, i.e. the fast bitvector algorithm, and will fall back to the standard pairwise alignment.The edit distance can be further refined using the seqan3::align_cfg::min_score configuration to fix an edit score (a limit of the allowed number of edits). If the respective alignment could not find a solution within the given error bound, the resulting score is infinity (corresponds to std::numeric_limits::max). Also the alignment and the begin and end positions of the alignment can be computed using a combination of the seqan3::align_cfg::output_alignment, seqan3::align_cfg::output_begin_position and seqan3::align_cfg::output_end_position options.
Compute all pairwise alignments from the assignment 1 (only the scores). Only allow at most 7 errors and filter all alignments with 6 or fewer errors.
Chaining the configurations to build an individual alignment algorithm is a strong advantage of this design. However, some combinations would result in an invalid alignment configuration. To explicitly prevent this we added some security details. First, if a combination is invalid (for example by providing the same configuration more than once) a static assert will inform you about the invalid combination. Here you can find a table depicting the valid configurations. Further, if the seqan3::align_pairwise is called, it checks if the input data can be used with the given configuration. For example, a static assertion is emitted if the alphabet types of the sequences together with the provided scoring scheme do not model the concept seqan3::scoring_scheme_for. Other possible errors are invalid band settings where the initialised band does not intersect with the actual alignment matrix (the lower diagonal starts beyond the end of the first sequence).