SeqAn3
The Modern C++ library for sequence analysis.
Pairwise Alignment

Learning Objective:

In this tutorial you will learn how to compute pairwise sequence alignments with SeqAn3. This tutorial is a walkthrough with links to the API documentation and is also meant as a source for copy-and-paste code.

DifficultyIntermediate
Duration60-90 min
Prerequisite tutorialsQuick Setup (using CMake) Alphabets in SeqAn3 Ranges
Recommended reading

Introduction

Aligning biological sequences is a very prominent component in many bioinformatics applications and pipelines. Well known genomic applications that use pairwise sequence alignments are read mapping, genome assembly, variant detection, multiple sequence alignment as well as protein search.

The goal of the pairwise alignment is to obtain an optimal transcript that describes how two DNA sequences are related to each other by means of substitutions, insertions, or deletions. The computed transcript describes then the operations necessary to translate the one sequence into the other, as can be seen in the following picture.

A transcript between two aligned sequences

The alignment problem is solved with a dynamic programming (DP) algorithm which runs in $ (\mathcal{O}(n^2))$ time and space. Besides the global alignment approach many more variations of this DP based algorithm have been developed over time. SeqAn unified all of these approaches into a single DP core implementation which can be extended easily and thus, with all possible configurations, is a very versatile and powerful tool to compute many desired alignment variants.

Computing pairwise alignments

Let us first have look at an example of computing a global alignment in SeqAn3.

1 #include <utility>
2 
7 
8 using namespace seqan3;
9 
10 int main()
11 {
12  dna4_vector s1 = "ACGTGAACTGACT"_dna4;
13  dna4_vector s2 = "ACGAAGACCGAT"_dna4;
14 
15  // Configure the alignment kernel.
16  auto config = align_cfg::mode{global_alignment} |
18 
19  // Invoke the pairwise alignment which returns a lazy range over alignment results.
20  auto results = align_pairwise(std::tie(s1, s2), config);
21  auto & res = *results.begin();
22  debug_stream << "Score: " << res.score() << '\n';
23 }
Provides pairwise alignment function.
T tie(T... args)
constexpr detail::global_alignment_type global_alignment
Helper variable to select the global alignment.
Definition: align_config_mode.hpp:54
constexpr auto align_pairwise(exec_policy_t const &exec, sequence_t &&seq, alignment_config_t const &config)
Computes the pairwise alignment for a pair of sequences or a range over sequence pairs.
Definition: align_pairwise.hpp:139
Provides seqan3::nucleotide_scoring_scheme.
The main SeqAn3 namespace.
Sets the scoring scheme for the alignment algorithm.
Definition: align_config_scoring.hpp:41
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition: debug_stream.hpp:39
A data structure for managing and computing the score of two nucleotides.
Definition: nucleotide_scoring_scheme.hpp:37
Sets the alignment mode.
Definition: align_config_mode.hpp:94
Provides seqan3::dna4, container aliases and string literals.
Provides seqan3::debug_stream and related types.

In the above example we want to compute the similarity of two seqan3::dna4 sequences using a global alignment. For this we need to import the dna4 alphabet, the seqan3::nucleotide_scoring_scheme and the seqan3::align_pairwise in the beginning of our file. We also import the seqan3::debug_stream which allows us to print various types in a nice formatted manner.

In the beginning of the file we are defining our two sequences DNA sequences first_seq and second_seq. If you feel puzzled about what the _dna4 suffix does we recommend to read the Alphabets in SeqAn3 and Ranges before. In line 16-17 we configure the alignment job with the most simplistic configuration possible. In this case it is a global alignment with edit distance. Later in this tutorial we will give a more detailed description of the configuration and how it can be used. The minimum requirement for computing a pairwise alignment is to specify the seqan3::align_cfg::mode and the seqan3::align_cfg::scoring configuration elements. The first one selects the internal algorithm and the second one provides the scoring scheme that should be used to score a pair of sequence characters.

Now we are going to call seqan3::align_pairwise. This interface requires two arguments: a tuple or a range of tuples with exactly two elements and the configuration object. Independent of the number of pairwise alignment jobs submitted to the algorithm it always returns a range over seqan3::alignment_result. Later we will see how we can use a continuation interface which calls a user-defined function rather than iterating over the results sequentially. Finally, we output the score for the computed alignment.

Attention
Just calling pairwise_align does nothing as it returns a range which is evaluated in a lazy manner. Only when calling begin or incrementing the iterator over the range the alignment computation is invoked.

Assignment 1

Copy and paste the minimal working example from above into a cpp file in your working directory and compile and run it. Add the two sequences "ACGTGACTGACT" and "AGGTACGAGCGACACT" to the set and compute all pairwise sequence alignments of the four sequences and output the scores.

Hint You can use the seqan3::views::pairwise_combine to generate all pairwise combinations.
Solution

#include <utility>
#include <vector>
using namespace seqan3;
int main()
{
std::vector vec{"ACGTGAACTGACT"_dna4,
"ACGAAGACCGAT"_dna4,
"ACGTGACTGACT"_dna4,
"AGGTACGAGCGACACT"_dna4};
// Configure the alignment kernel.
for (auto const & res : align_pairwise(views::pairwise_combine(vec), config))
debug_stream << "Score: " << res.score() << '\n';
}

First we create the vector of seqan3::dna4 sequences. We keep the configuration as is and then modify the initial code to a range-based for loop looping over the alignment results. Since the seqan3::alignment_result is a class template and the template parameters are determined during the configuration step we use auto as the result type. The current result is cached inside of the lazy range and we capture the result as const & in order to not tamper with the result values.

Congratulations, you have computed your first pairwise alignment with seqan3! As you can see, the interface is really simple, yet the configuration object makes it extremely flexible to conduct various different alignment calculations. In the following chapter you will learn more about the various configuration possibilities.

Alignment configurations

The configuration object is the core of the alignment interface. It allows to easily configure the alignment algorithm without changing the interface of the pairwise alignment. It uses a seqan3::configuration object to chain different configuration elements together using the logical or-operator ('|'-operator). You can find an overview over the available configurations here. The configurations for the alignment module are available in:

The configuration elements are all classes that wrap the actual information necessary for the configuration of the alignment algorithm. Depending on the configuration specification certain features of the algorithm are enabled or disabled. Moreover, during the initialisation of the algorithm the best implementation is chosen based on the given configurations. To avoid possible ambiguities with the configurations of other algorithms, the configuration elements for the alignment are defined in the special namespace seqan3::align_cfg.

Global and semi-global alignment

The most straightforward algorithm is the global alignment which can be configured using the seqan3::align_cfg::mode with seqan3::global_alignment as the constructor argument.

Remarks
The mode configuration must be given by the user as it strongly depends on the application context. It would be wrong for us to assume what the intended default behaviour should be.

The global alignment can be further refined by setting the seqan3::align_cfg::aligned_ends option. The seqan3::align_cfg::aligned_ends class specifies wether or not gaps at the end of the sequences are penalised. In SeqAn you can configure this behaviour for every end (front and back of the first sequence and second sequence) separately using the seqan3::end_gaps class. This class is constructed with up to 4 end gap specifiers (one for every end):

These classes can be constructed with either a constant boolean (std::true_type or std::false_type) or a regular bool argument. The former enables static configuration of the respective features in the alignment algorithm. The latter allows to configure these features at runtime. This makes setting these values from runtime dependent parameters, e.g. user input, much easier. The following code snippet demonstrates the different use cases:

using namespace seqan3;
front_end_second fes{true};
back_end_second bes{false};
auto cfg_1 = align_cfg::aligned_ends{end_gaps{fef, bef, fes, bes}};
auto cfg_2 = align_cfg::aligned_ends{end_gaps{fef, fes}};

The cfg_1 and the cfg_2 will result in the exact same configuration of the alignment where aligning the front of either sequence with gaps is not penalised while the back of both sequences is. The order of the arguments is irrelevant. Specifiers initialised with constant booleans can be mixed with those initialised with bool values. If a specifier for a particular sequence end is not given, it defaults to the specifier initialised with std::false_type.

Note
You should always prefer initialising the end-gaps specifiers using the boolean constants if possible as it reduces the compile time. The reason for this is that the runtime information is converted into static types for the alignment algorithm. For every end-gap specifier the compiler will generate two versions for the true and the false case. This adds up to 16 different paths the compiler needs to instantiate.

SeqAn also offers predefined seqan3::end_gaps configurations that cover the typical use cases.

Entity Meaning
free_ends_none Enables the typical global alignment.
free_ends_all Enables overlap alignment, where the end of one sequence can overlap the end of the other sequence.
free_ends_first Enables semi global alignment, where the second sequence is aligned as an infix of the first sequence.
free_ends_second Enables semi global alignment, where the first sequence is aligned as an infix of the second sequence.

Assignment 2

Adapt the code from above and compute the semi-global alignment where both ends of the first sequence can be aligned to gaps without penalising them.

Solution

#include <utility>
#include <vector>
using namespace seqan3;
int main()
{
std::vector vec{"ACGTGAACTGACT"_dna4,
"ACGAAGACCGAT"_dna4,
"ACGTGACTGACT"_dna4,
"AGGTACGAGCGACACT"_dna4};
// Configure the alignment kernel.
align_cfg::aligned_ends{free_ends_first};
for (auto const & res : align_pairwise(views::pairwise_combine(vec), config))
debug_stream << "Score: " << res.score() << '\n';
}

To accomplish our goal we simply add the align_cfg::aligned_ends option initialised with free_ends_first to the existing configuration.

Scoring schemes and gap schemes

A scoring scheme can be queried to get the score for substituting two alphabet values using the score member function. SeqAn currently supports currently two scoring schemes: seqan3::nucleotide_scoring_scheme and seqan3::aminoacid_scoring_scheme. You can import them with the following includes:

As the names suggests, you need to use the former when scoring nucleotides and the latter when working with aminoacids. You have already used the seqan3::nucleotide_scoring_scheme in the assignments before. The scoring scheme was default initialised which will result in using the Hamming Distance. The scoring schemes can also be configured by either setting a simple scheme consisting of seqan3::match_score and seqan3::mismatch_score or by providing a custom matrix. The amino acid scoring scheme can additionally be initialised with a predefined substitution matrix that can be accessed via the seqan3::aminoacid_similarity_matrix enumeration class.

using namespace seqan3;
// Define a simple scoring scheme with match and mismatch cost and get the score.
auto sc_nc = nc_scheme.score('A'_dna4, 'C'_dna4); // sc_nc == -5.
// Define a amino acid similarity matrix and get the score.
auto sc_aa = aa_scheme.score('M'_aa27, 'K'_aa27); // sc_aa == 2.
Note
You can also provide your own scoring scheme implementation if it models seqan3::scoring_scheme.

Similarly to the scoring scheme, you can use the seqan3::gap_scheme to set the gap penalties used for the alignment computation. The default constructed seqan3::gap_scheme sets the score for a gap to -1 and for a gap opening to 0. Note that the gap open score is added to the gap score when a gap is opened within the alignment computation. Therefore setting the gap open score to 0 disables affine gaps. You can pass a seqan3::gap_score and optionally a seqan3::gap_open_score object to initialise the scheme with custom gap penalties. The penalties can be changed later by using the respective member functions seqan3::gap_scheme::set_linear or seqan3::gap_scheme::set_affine.

Attention
SeqAn's alignment algorithm computes the maximal similarity score, thus the match score must be set to a positive value and the score for mismatch and gaps must be negative in order to maximise over the matching letters.
using namespace seqan3;
// Define a gap scheme with custom gap scores.
auto gap = g.get_gap_score(); // gap == -1
auto gap_open = g.get_gap_open_score(); // gap_open == -10

To configure the scoring scheme and the gap scheme for the alignment algorithm you need to use the seqan3::align_cfg::scoring and the seqan3::align_cfg::gap configurations. The seqan3::align_cfg::scoring is mandatory - similarly to the seqan3::align_cfg::mode configuration. It would be wrong to assume what the default scoring scheme should be. If you do not provide these configurations, the compilation will fail with a corresponding error message. Not providing the gap scheme is ok. In this case the default initialised gap scheme will be used for the alignment computation.

Assignment 3

Compute the alignment of the two amino acid sequences listed below using an affine gap scheme with gap costs of -2 and gap open costs of -9. Use the BLOSUM62 similarity matrix.

  • QFSEEILSDIYCWMLQCGQERAV
  • AFLPGWQENKTKIWMKDCGCLWHKG
Solution
#include <utility>
using namespace seqan3;
int main()
{
auto seq1 = "QFSEEILSDIYCWMLQCGQERAV"_aa27;
auto seq2 = "AFLPGWQEENKLSKIWMKDCGCLW"_aa27;
// Configure the alignment kernel.
for (auto const & res : align_pairwise(std::tie(seq1, seq2), config))
debug_stream << "Score: " << res.score() << '\n';
}

Only a few parts of our algorithm need to be adapted. First, we use an amino acid scoring scheme and initialise it with the respective similarity matrix. Second, we initialise the gap scheme to represent the affine gap model as given in the assignment. Et voilĂ , we have computed a pairwise alignment over aminoacid sequences.

Alignment result

So far we have only computed the score, but obviously in many situations the final alignment is required, e.g. when mapping reads and the user wishes to write the alignment to the final SAM/BAM file. In SeqAn you can simply configure what is going to be computed by the alignment algorithm using the seqan3::align_cfg::result configuration.

using namespace seqan3;
// Configure the alignment to only compute the score.
auto cfg = align_cfg::result{with_score};

Accordingly, the alignment algorithm is configured to use the best implementation to obtain the desired result. The following table shows the different outcomes that can be configured:

Entity Available result
seqan3::with_score alignment score
seqan3::with_back_coordinate alignment score; back coordinate
seqan3::with_front_coordinate alignment score; back and front coordinate
seqan3::with_alignment alignment score; back and front coordinate; alignment

The final result is returned as a seqan3::alignment_result object. This object offers special member functions to access the stored values. If you try to access a value, e.g. the alignment, although you didn't specify with_alignment in the result configuration, a static assertion will be triggered during compilation.

Assignment 4

Compute the overlap alignment of the following two sequences. Use a linear gap scheme with a gap score of -4 and a simple scoring scheme with mismatch -2 and match 4.

  • TTACGTACGGACTAGCTACAACATTACGGACTAC
  • GGACGACATGACGTACGACTTTACGTACGACTAGC
Solution
#include <utility>
using namespace seqan3;
int main()
{
auto seq1 = "TTACGTACGGACTAGCTACAACATTACGGACTAC"_dna4;
auto seq2 = "GGACGACATGACGTACGACTTTACGTACGACTAGC"_dna4;
// Configure the alignment kernel.
align_cfg::aligned_ends{free_ends_all} |
align_cfg::result{with_alignment};
for (auto const & res : align_pairwise(std::tie(seq1, seq2), config))
{
debug_stream << "Score: " << res.score() << '\n';
debug_stream << "Begin: " << res.front_coordinate() << '\n';
debug_stream << "End: " << res.back_coordinate() << '\n';
debug_stream << "Alignment: \n" << res.alignment() << '\n';
}
}

Banded alignment

In many situations it is not necessary to compute the entire alignment matrix but only a part of it. This has positive impacts on the performance. To limit the computation space the alignment matrix can be bounded by a band. Thus, only the alignment is computed that fits in this band. Note that this must not be the optimal alignment but in many cases we can give a rough bound on how similar the sequences will be and therefor use the banded alignment. To do so, you can use a seqan3::static_band. It will be initialised with a seqan3::lower_bound and a seqan3::upper_bound. To configure the banded alignment you need to use the seqan3::align_cfg::band configuration.

using namespace seqan3;
// Configure a banded alignment.

Assignment 5

Use the example from assignment 4 and compute it in a band with lower bound set to -3 and upper bound set to 8. How does the result change?

Solution

#include <utility>
using namespace seqan3;
int main()
{
auto seq1 = "TTACGTACGGACTAGCTACAACATTACGGACTAC"_dna4;
auto seq2 = "GGACGACATGACGTACGACTTTACGTACGACTAGC"_dna4;
// Configure the alignment kernel.
align_cfg::aligned_ends{free_ends_all} |
align_cfg::result{with_alignment} |
for (auto const & res : align_pairwise(std::tie(seq1, seq2), config))
{
debug_stream << "Score: " << res.score() << '\n';
debug_stream << "Begin: " << res.front_coordinate() << '\n';
debug_stream << "End: " << res.back_coordinate() << '\n';
debug_stream << "Alignment: \n" << res.alignment() << '\n';
}
}

Edit distance

A special form of the pairwise sequence alignment is the edit distance. This distance metric counts the number of edits necessary to transform one sequence into the other. The cost model for the edit distance is fixed. In particular, the match score is 0 and the scores for a mismatch and a gap is -1. Due to the special metric a fast bitvector implementation can be used to compute the edit distance. This happens in SeqAn automatically if the respective configurations are used. To do so, you need a scoring scheme initialised with Manhattan distance (at the moment only seqan3::nucleotide_scoring_scheme supports this) and a gap scheme initialised with -1 for a gap and 0 for a gap open score and computing a seqan3::global_alignment. To make the configuration easier, we added a shortcut called seqan3::align_cfg::edit.

using namespace seqan3;
// Configure an edit distance alignment.
auto cfg = align_cfg::edit;

Refine edit distance

The edit distance can be further refined using seqan3::align_cfg::aligned_ends to also compute a semi-global alignment and the seqan3::align_cfg::max_error configuration to give an upper limit of the allowed number of edits. If the respective alignment could not find a solution within the given error bound, the resulting score is infinity (corresponds to std::numeric_limits::max). Also the alignment and the front and back coordinates can be computed using the align_cfg::result option.

Attention
Only the options seqan3::free_ends_none and seqan3::free_ends_first are supported for the aligned ends configuration with the edit distance. Using any other aligned ends configuration will disable the edit distance and fall back to the standard pairwise alignment and will not use the fast bitvector algorithm.

Assignment 5

Compute all pairwise alignments from the assignment 1 (only the scores). Only allow at most 7 errors and filter all alignments with 6 or less errors.

Hint You can use the std::views::filter to get only those alignments that fit the requirements.

Solution

#include <utility>
#include <vector>
using namespace seqan3;
int main()
{
std::vector vec{"ACGTGACTGACT"_dna4,
"ACGAAGACCGAT"_dna4,
"ACGTGACTGACT"_dna4,
"AGGTACGAGCGACACT"_dna4};
// Configure the alignment kernel.
auto config = align_cfg::edit |
align_cfg::result{with_score};
auto filter_v = std::views::filter([](auto && res) { return res.score() >= -6;});
for (auto const & res : align_pairwise(views::pairwise_combine(vec), config) | views::persist | filter_v)
{
debug_stream << "Score: " << res.score() << '\n';
}
}

Invalid configurations

Chaining the configurations to build an individual alignment algorithm is a strong advantage of this design. However, some combinations would result in an invalid alignment configuration. To explicitly prevent this we added some security details. First, if a combination is invalid (for example by providing the same configuration more than once) a static assert will inform you about the invalid combination. Here you can find a table depicting the valid configurations. Further, if the seqan3::align_pairwise is called, it checks if the input data can be used with the given configuration. For example, a static assertion is emitted if the alphabet types of the sequences together with the provided scoring scheme do not model seqan3::scoring_scheme. Other possible errors are invalid band settings where the initialised band does not intersect with the actual alignment matrix (the lower diagonal starts beyond the end of the first sequence).