SeqAn3  3.0.2
The Modern C++ library for sequence analysis.
Pairwise Alignment

Learning Objective:

In this tutorial you will learn how to compute pairwise sequence alignments with SeqAn. This tutorial is a walkthrough with links to the API documentation and is also meant as a source for copy-and-paste code.

DifficultyIntermediate
Duration60-90 min
Prerequisite tutorialsQuick Setup (using CMake) Alphabets in SeqAn Ranges
Recommended reading

Introduction

Aligning biological sequences is a very prominent component in many bioinformatics applications and pipelines. Well known genomic applications that use pairwise sequence alignments are read mapping, genome assembly, variant detection, multiple sequence alignment as well as protein search.

The goal of the pairwise alignment is to obtain an optimal transcript that describes how two DNA sequences are related to each other by means of substitutions, insertions, or deletions. The computed transcript describes then the operations necessary to translate the one sequence into the other, as can be seen in the following picture.

The alignment problem is solved with a dynamic programming (DP) algorithm which runs in $ (\mathcal{O}(n^2))$ time and space. Besides the global alignment approach many more variations of this DP based algorithm have been developed over time. SeqAn unified all of these approaches into a single DP core implementation which can be extended easily and thus, with all possible configurations, is a very versatile and powerful tool to compute many desired alignment variants.

Computing pairwise alignments

Let us first have look at an example of computing a global alignment in SeqAn.

1 #include <utility>
2 
7 
8 using seqan3::operator""_dna4;
9 
10 int main()
11 {
12  seqan3::dna4_vector s1 = "ACGTGAACTGACT"_dna4;
13  seqan3::dna4_vector s2 = "ACGAAGACCGAT"_dna4;
14 
15  // Configure the alignment kernel.
16  auto config = seqan3::align_cfg::method_global{} |
18 
19  // Invoke the pairwise alignment which returns a lazy range over alignment results.
20  auto results = seqan3::align_pairwise(std::tie(s1, s2), config);
21  auto & res = *results.begin();
22  seqan3::debug_stream << "Score: " << res.score() << '\n';
23 }

In the above example we want to compute the similarity of two seqan3::dna4 sequences using a global alignment. For this we need to import the dna4 alphabet, the seqan3::nucleotide_scoring_scheme and the seqan3::align_pairwise in the beginning of our file. We also import the seqan3::debug_stream which allows us to print various types in a nice formatted manner.

In the beginning of the file we are defining our two DNA sequences s1 and s2. If you feel puzzled about what the _dna4 suffix does we recommend to read the Alphabets in SeqAn and Ranges before. In line 16-17 we configure the alignment job with the most simplistic configuration possible. In this case it is a global alignment with edit distance. Later in this tutorial we will give a more detailed description of the configuration and how it can be used. The minimum requirement for computing a pairwise alignment is to specify the alignment method (seqan3::align_cfg::method_local or seqan3::align_cfg::method_global) and the seqan3::align_cfg::scoring_scheme configuration elements. The first one selects the internal algorithm and the second one provides the scoring scheme that should be used to score a pair of sequence characters.

Now we are going to call seqan3::align_pairwise. This interface requires two arguments: a tuple or a range of tuples with exactly two elements and the configuration object. Independent of the number of pairwise alignment jobs submitted to the algorithm it always returns a range over seqan3::alignment_result. Later we will see how we can use a continuation interface which calls a user-defined function rather than iterating over the results sequentially. Finally, we output the score for the computed alignment.

Attention
Just calling pairwise_align does nothing as it returns a range which is evaluated in a lazy manner. Only when calling begin or incrementing the iterator over the range the alignment computation is invoked.

Assignment 1

Copy and paste the minimal working example from above into a cpp file in your working directory and compile and run it. Add the two sequences "ACGTGACTGACT" and "AGGTACGAGCGACACT" to the set and compute all pairwise sequence alignments of the four sequences and output the scores.

Hint You can use the seqan3::views::pairwise_combine to generate all pairwise combinations.
Solution

#include <utility>
#include <vector>
using seqan3::operator""_dna4;
int main()
{
std::vector vec{"ACGTGAACTGACT"_dna4,
"ACGAAGACCGAT"_dna4,
"ACGTGACTGACT"_dna4,
"AGGTACGAGCGACACT"_dna4};
// Configure the alignment kernel.
for (auto const & res : seqan3::align_pairwise(seqan3::views::pairwise_combine(vec), config))
seqan3::debug_stream << "Score: " << res.score() << '\n';
}

First we create the vector of seqan3::dna4 sequences. We keep the configuration as is and then modify the initial code to a range-based for loop looping over the alignment results. Since the seqan3::alignment_result is a class template and the template parameters are determined during the configuration step we use auto as the result type. The current result is cached inside of the lazy range and we capture the result as const & in order to not tamper with the result values.

Congratulations, you have computed your first pairwise alignment with SeqAn! As you can see, the interface is really simple, yet the configuration object makes it extremely flexible to conduct various different alignment calculations. In the following chapter you will learn more about the various configuration possibilities.

Alignment configurations

The configuration object is the core of the alignment interface. It allows to easily configure the alignment algorithm without changing the interface of the pairwise alignment. It uses a seqan3::configuration object to chain different configuration elements together using the logical or-operator ('|'-operator). You can find an overview over the available configurations here. The configurations for the alignment module are available in:

The configuration elements are all classes that wrap the actual information necessary for the configuration of the alignment algorithm. Depending on the configuration specification certain features of the algorithm are enabled or disabled. Moreover, during the initialisation of the algorithm the best implementation is chosen based on the given configurations. To avoid possible ambiguities with the configurations of other algorithms, the configuration elements for the alignment are defined in the special namespace seqan3::align_cfg.

Global and semi-global alignment

The most straightforward algorithm is the global alignment which can be configured using the seqan3::align_cfg::method_global.

Remarks
The method configuration must be given by the user as it strongly depends on the application context. It would be wrong for us to assume what the intended default behaviour should be.

The global alignment can be further refined by initialising the seqan3::align_cfg::method_global configuration element with the free end gap specifiers. They specify whether gaps at the end of the sequences are penalised. In SeqAn you can configure this behaviour for every end, namely for leading and trailing gaps of the first and second sequence. seqan3::align_cfg::method_global is constructed with 4 free end gap specifiers (one for every end):

The following code snippet demonstrates the different use cases:

// Example of a semi-global alignment where leading and trailing gaps in the
// second sequence are not penalised:

The order of arguments is fixed and must always be as shown in the example.

Assignment 2

Adapt the code from Assignment 1 to compute the semi-global alignment for the case where both ends of the first sequence can be aligned to gaps without penalising them. Note that in such a semi-global alignment, the first sequence would be aligned as an infix of the second sequence.

Solution

To accomplish our goal we initialise the method_global option with the free end specifiers for sequence 1 set to true, and those for sequence 2 with false.

Scoring schemes and gap schemes

A scoring scheme can be queried to get the score for substituting two alphabet values using the score member function. SeqAn currently supports currently two scoring schemes: seqan3::nucleotide_scoring_scheme and seqan3::aminoacid_scoring_scheme. You can import them with the following includes:

As the names suggests, you need to use the former when scoring nucleotides and the latter when working with aminoacids. You have already used the seqan3::nucleotide_scoring_scheme in the assignments before. The scoring scheme was default initialised which will result in using the Hamming Distance. The scoring schemes can also be configured by either setting a simple scheme consisting of seqan3::match_score and seqan3::mismatch_score or by providing a custom matrix. The amino acid scoring scheme can additionally be initialised with a predefined substitution matrix that can be accessed via the seqan3::aminoacid_similarity_matrix enumeration class.

using seqan3::operator""_dna4;
using seqan3::operator""_aa27;
// Define a simple scoring scheme with match and mismatch cost and get the score.
auto sc_nc = nc_scheme.score('A'_dna4, 'C'_dna4); // sc_nc == -5.
// Define a amino acid similarity matrix and get the score.
aa_scheme.set_similarity_matrix(seqan3::aminoacid_similarity_matrix::BLOSUM30);
auto sc_aa = aa_scheme.score('M'_aa27, 'K'_aa27); // sc_aa == 2.
Note
You can also provide your own scoring scheme implementation if it models seqan3::scoring_scheme.

Similarly to the scoring scheme, you can use the seqan3::align_cfg::gap_cost_affine to set the gap penalties used for the alignment computation. The default initialised seqan3::align_cfg::gap_cost_affine sets the score for a gap to -1 and for a gap opening to 0. Note that the gap open score is added to the gap score when a gap is opened within the alignment computation. Therefore setting the gap open score to 0 disables affine gaps. You can pass a seqan3::align_cfg::extension_score and a seqan3::align_cfg::open_score object to initialise the scheme with custom gap penalties. The penalties can be assessed changed later by using the respective member variables extension_score and open_score.

Attention
SeqAn's alignment algorithm computes the maximal similarity score, thus the match score must be set to a positive value and the score for mismatch and gaps must be negative in order to maximise over the matching letters.
// Define a gap scheme with custom gap scores.
int open_score = affine_scheme.open_score; // == -10
int extension_score = affine_scheme.extension_score; // == -1

To configure the scoring scheme and the gap scheme for the alignment algorithm you need to use the seqan3::align_cfg::scoring_scheme and the seqan3::align_cfg::gap configurations. The seqan3::align_cfg::scoring_scheme is mandatory - similarly to the alignment method configuration. It would be wrong to assume what the default scoring scheme should be. If you do not provide these configurations, the compilation will fail with a corresponding error message. Not providing the gap scheme is ok. In this case the default initialised gap scheme will be used for the alignment computation.

Assignment 3

Compute the alignment of the two amino acid sequences listed below using an affine gap scheme with gap costs of -2 and gap open costs of -9. Use the BLOSUM62 similarity matrix.

  • QFSEEILSDIYCWMLQCGQERAV
  • AFLPGWQEENKLSKIWMKDCGCLW
Solution
#include <seqan3/alignment/pairwise/all.hpp> // for seqan3::align_cfg and seqan3::align_pairwise
#include <seqan3/alignment/scoring/all.hpp> // for seqan3::aminoacid_scoring_scheme and
// seqan3::aminoacid_similarity_matrix
#include <seqan3/alphabet/aminoacid/aa27.hpp> // for seqan3::operator""_aa27
using seqan3::operator""_aa27;
int main()
{
auto seq1 = "QFSEEILSDIYCWMLQCGQERAV"_aa27;
auto seq2 = "AFLPGWQEENKLSKIWMKDCGCLW"_aa27;
// Configure the alignment kernel.
seqan3::aminoacid_scoring_scheme{seqan3::aminoacid_similarity_matrix::BLOSUM62}} |
for (auto const & res : seqan3::align_pairwise(std::tie(seq1, seq2), config))
seqan3::debug_stream << "Score: " << res.score() << '\n';
}

Only a few parts of our algorithm need to be adapted. First, we use an amino acid scoring scheme and initialise it with the respective similarity matrix. Second, we initialise the gap scheme to represent the affine gap model as given in the assignment. Et voilĂ , we have computed a pairwise alignment over aminoacid sequences.

Alignment result

So far we have only used the score, but obviously in many situations the final alignment is required, e.g. when mapping reads and the user wishes to write the alignment to the final SAM/BAM file. In SeqAn you can simply configure what is going to be computed by the alignment algorithm using the different output configurations.

// Configure the alignment to only compute the score.

Accordingly, the alignment algorithm is configured to use the best implementation to obtain the desired result. The following table shows the different outcomes that can be configured:

Output option Available result
seqan3::align_cfg::output_score alignment score
seqan3::align_cfg::output_end_position end positions of the aligned sequences
seqan3::align_cfg::output_begin_position begin positions of the aligned sequences
seqan3::align_cfg::output_alignment alignment of the two sequences
seqan3::align_cfg::output_sequence1_id id of the first sequence
seqan3::align_cfg::output_sequence2_id id of the second sequence

The final result is returned as a seqan3::alignment_result object. This object offers special member functions to access the stored values. If you try to access a value, e.g. the alignment, although you didn't specify seqan3::align_cfg::output_alignment in the result configuration, a static assertion will be triggered during compilation.

Note
If you don't specify any of the above mentioned output configurations then by default all options are enabled and will be computed. In order to potentially increase the performance of the alignment algorithm only enable those options that are needed for your use case.

Assignment 4

Compute the overlap alignment of the following two sequences. Use a linear gap scheme with a gap score of -4 and a simple scoring scheme with mismatch -2 and match 4.

  • TTACGTACGGACTAGCTACAACATTACGGACTAC
  • GGACGACATGACGTACGACTTTACGTACGACTAGC
Solution
#include <utility>
using seqan3::operator""_dna4;
int main()
{
auto seq1 = "TTACGTACGGACTAGCTACAACATTACGGACTAC"_dna4;
auto seq2 = "GGACGACATGACGTACGACTTTACGTACGACTAGC"_dna4;
// Configure the output:
auto output_config = seqan3::align_cfg::output_score{} |
// Configure the alignment kernel together with the previous output configuration.
output_config;
for (auto const & res : seqan3::align_pairwise(std::tie(seq1, seq2), config))
{
seqan3::debug_stream << "Score: " << res.score() << '\n';
seqan3::debug_stream << "Begin: (" << res.sequence1_begin_position() << "," << res.sequence2_begin_position()
<< ")\n";
seqan3::debug_stream << "End: (" << res.sequence1_end_position() << "," << res.sequence2_end_position()
<< ")\n";
seqan3::debug_stream << "Alignment: \n" << res.alignment() << '\n';
}
}

Banded alignment

In many situations it is not necessary to compute the entire alignment matrix but only a part of it. This has positive impacts on the performance. To limit the computation space the alignment matrix can be bounded by a band. Thus, only the alignment is computed that fits in this band. Note that this must not be the optimal alignment but in many cases we can give a rough bound on how similar the sequences will be and therefor use the banded alignment. To do so, you can configure the alignment using the seqan3::align_cfg::band_fixed_size option. This configuration element will be initialised with a seqan3::align_cfg::lower_diagonal and seqan3::align_cfg::upper_diagonal.

Assignment 5

Use the example from assignment 4 and compute it in a band with lower diagonal set to -3 and upper diagonal set to 8. How does the result change?

Solution

#include <utility>
using seqan3::operator""_dna4;
int main()
{
auto seq1 = "TTACGTACGGACTAGCTACAACATTACGGACTAC"_dna4;
auto seq2 = "GGACGACATGACGTACGACTTTACGTACGACTAGC"_dna4;
// Configure the output:
auto output_config = seqan3::align_cfg::output_score{} |
// Configure the alignment kernel.
output_config |
for (auto const & res : seqan3::align_pairwise(std::tie(seq1, seq2), config))
{
seqan3::debug_stream << "Score: " << res.score() << '\n';
seqan3::debug_stream << "Begin: (" << res.sequence1_begin_position() << "," << res.sequence2_begin_position()
<< ")\n";
seqan3::debug_stream << "End: (" << res.sequence1_end_position() << "," << res.sequence2_end_position()
<< ")\n";
seqan3::debug_stream << "Alignment: \n" << res.alignment() << '\n';
}
}

Edit distance

A special form of the pairwise sequence alignment is the edit distance. This distance metric counts the number of edits necessary to transform one sequence into the other. The cost model for the edit distance is fixed. In particular, the match score is 0 and the scores for a mismatch and a gap is -1. Due to the special metric a fast bitvector implementation can be used to compute the edit distance. This happens in SeqAn automatically if the respective configurations are used. To do so, you need a scoring scheme initialised with Manhattan distance (at the moment only seqan3::nucleotide_scoring_scheme supports this) and a gap scheme initialised with -1 for a gap and 0 for a gap open score and computing a seqan3::global_alignment. To make the configuration easier, we added a shortcut called seqan3::align_cfg::edit_scheme.

// Configure an edit distance alignment.

The edit_scheme still has to be combined with an alignment method. When combining it with the seqan3::align_cfg::method_global configuration element, the edit distance algorithm can be further refined with free end gaps (see section Global and semi-global alignment).

Attention
Only the following free end gap configurations are supported for the global alignment configuration with the edit scheme:
  • no free end gaps (all free end gap specifiers are set to false)
  • free end gaps for the first sequence (free end gaps are set to true for the first and to false for the second sequence) Using any other free end gap configuration will disable the edit distance and fall back to the standard pairwise alignment and will not use the fast bitvector algorithm.

Refine edit distance

The edit distance can be further refined using the seqan3::align_cfg::min_score configuration to fix an edit score (a limit of the allowed number of edits).. If the respective alignment could not find a solution within the given error bound, the resulting score is infinity (corresponds to std::numeric_limits::max). Also the alignment and the begin and end positions of the alignment can be computed using a combination of the align_cfg::output_alignment, align_cfg::output_begin_position and align_cfg::output_end_position options.

Assignment 6

Compute all pairwise alignments from the assignment 1 (only the scores). Only allow at most 7 errors and filter all alignments with 6 or less errors.

Hint You can use the std::views::filter to get only those alignments that fit the requirements.

Solution

#include <utility>
#include <vector>
using seqan3::operator""_dna4;
int main()
{
std::vector vec{"ACGTGACTGACT"_dna4,
"ACGAAGACCGAT"_dna4,
"ACGTGACTGACT"_dna4,
"AGGTACGAGCGACACT"_dna4};
// Configure the alignment kernel.
auto filter_v = std::views::filter([](auto && res) { return res.score() >= -6;});
for (auto const & res : seqan3::align_pairwise(seqan3::views::pairwise_combine(vec), config) | seqan3::views::persist | filter_v)
{
seqan3::debug_stream << "Score: " << res.score() << '\n';
}
}

Invalid configurations

Chaining the configurations to build an individual alignment algorithm is a strong advantage of this design. However, some combinations would result in an invalid alignment configuration. To explicitly prevent this we added some security details. First, if a combination is invalid (for example by providing the same configuration more than once) a static assert will inform you about the invalid combination. Here you can find a table depicting the valid configurations. Further, if the seqan3::align_pairwise is called, it checks if the input data can be used with the given configuration. For example, a static assertion is emitted if the alphabet types of the sequences together with the provided scoring scheme do not model seqan3::scoring_scheme. Other possible errors are invalid band settings where the initialised band does not intersect with the actual alignment matrix (the lower diagonal starts beyond the end of the first sequence).

debug_stream.hpp
Provides seqan3::debug_stream and related types.
align_config_gap_cost_affine.hpp
Provides seqan3::align_config::gap_cost_affine.
seqan3::align_cfg::free_end_gaps_sequence1_trailing
A strong type representing free_end_gaps_sequence1_trailing of the seqan3::align_cfg::method_global.
Definition: align_config_method.hpp:85
dna4.hpp
Provides seqan3::dna4, container aliases and string literals.
align_pairwise.hpp
Provides pairwise alignment function.
utility
seqan3::align_cfg::output_end_position
Configures the alignment result to output the end position.
Definition: align_config_output.hpp:86
seqan3::align_cfg::free_end_gaps_sequence2_leading
A strong type representing free_end_gaps_sequence2_leading of the seqan3::align_cfg::method_global.
Definition: align_config_method.hpp:75
align_config_method.hpp
Provides global and local alignment configurations.
vector
seqan3::align_cfg::free_end_gaps_sequence1_leading
A strong type representing free_end_gaps_sequence1_leading of the seqan3::align_cfg::method_global.
Definition: align_config_method.hpp:65
seqan3::nucleotide_scoring_scheme
A data structure for managing and computing the score of two nucleotides.
Definition: nucleotide_scoring_scheme.hpp:38
nucleotide_scoring_scheme.hpp
Provides seqan3::nucleotide_scoring_scheme.
seqan3::align_cfg::edit_scheme
constexpr configuration edit_scheme
Shortcut for edit distance configuration.
Definition: align_config_edit.hpp:51
seqan3::align_pairwise
constexpr auto align_pairwise(sequence_t &&seq, alignment_config_t const &config)
Computes the pairwise alignment for a pair of sequences or a range over sequence pairs.
Definition: align_pairwise.hpp:138
all.hpp
Meta-header for the pairwise submodule .
seqan3::views::persist
constexpr auto persist
A view adaptor that wraps rvalue references of non-views.
Definition: persist.hpp:233
std::tie
T tie(T... args)
aminoacid_scoring_scheme.hpp
Provides seqan3::aminoacid_scoring_scheme.
seqan3::align_cfg::output_alignment
Configures the alignment result to output the alignment.
Definition: align_config_output.hpp:168
pairwise_combine.hpp
Provides seqan3::views::pairwise_combine.
seqan3::mismatch_score
A strong type of underlying type score_type that represents the score two different characters.
Definition: scoring_scheme_base.hpp:66
seqan3::views::pairwise_combine
constexpr auto pairwise_combine
A view adaptor that generates all pairwise combinations of the elements of the underlying range.
Definition: pairwise_combine.hpp:709
seqan3::align_cfg::output_begin_position
Configures the alignment result to output the begin positions.
Definition: align_config_output.hpp:129
align_config_band.hpp
Provides seqan3::detail::align_config_band.
seqan3::align_cfg::gap_cost_affine
A configuration element for the affine gap cost scheme.
Definition: align_config_gap_cost_affine.hpp:74
aa27.hpp
Provides seqan3::aa27, container aliases and string literals.
align_config_output.hpp
Provides configuration for alignment output.
seqan3::align_cfg::band_fixed_size
Configuration element for setting a fixed size band.
Definition: align_config_band.hpp:72
seqan3::align_cfg::free_end_gaps_sequence2_trailing
A strong type representing free_end_gaps_sequence2_trailing of the seqan3::align_cfg::method_global.
Definition: align_config_method.hpp:95
all.hpp
Meta-header for the alignment configuration module .
seqan3::align_cfg::output_score
Configures the alignment result to output the score.
Definition: align_config_output.hpp:43
seqan3::align_cfg::min_score
Sets the minimal score (maximal errors) allowed during an distance computation e.g....
Definition: align_config_min_score.hpp:37
seqan3::align_cfg::scoring_scheme
Sets the scoring scheme for the alignment algorithm.
Definition: align_config_scoring_scheme.hpp:45
ranges
Adaptations of concepts from the Ranges TS.
seqan3::aminoacid_scoring_scheme
A data structure for managing and computing the score of two amino acids.
Definition: aminoacid_scoring_scheme.hpp:75
seqan3::align_cfg::method_global
Sets the global alignment method.
Definition: align_config_method.hpp:106
seqan3::align_cfg::lower_diagonal
A strong type representing the lower diagonal of the seqan3::align_cfg::band_fixed_size.
Definition: align_config_band.hpp:29
all.hpp
Meta-header for the Scoring sub-module .
seqan3::aminoacid_scoring_scheme::set_similarity_matrix
constexpr void set_similarity_matrix(aminoacid_similarity_matrix const matrix_id)
Set the similarity matrix scheme (e.g. BLOSUM62).
Definition: aminoacid_scoring_scheme.hpp:122
seqan3::align_cfg::open_score
A strong type of underlying type int32_t that represents a score (usually negative) that is incurred ...
Definition: align_config_gap_cost_affine.hpp:34
seqan3::debug_stream
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition: debug_stream.hpp:42
seqan3::match_score
A strong type of underlying type score_type that represents the score of two matching characters.
Definition: scoring_scheme_base.hpp:41
seqan3::align_cfg::extension_score
A strong type of underlying type int32_t that represents the score (usually negative) of any characte...
Definition: align_config_gap_cost_affine.hpp:52
seqan3::align_cfg::upper_diagonal
A strong type representing the upper diagonal of the seqan3::align_cfg::band_fixed_size.
Definition: align_config_band.hpp:40
align_config_edit.hpp
Provides seqan3::align_cfg::edit_scheme.