SeqAn3 3.4.0-rc.1
The Modern C++ library for sequence analysis.
Loading...
Searching...
No Matches
Pairwise Alignment

Learning Objective:

In this tutorial, you will learn how to compute pairwise sequence alignments with SeqAn. This tutorial is a walkthrough with links to the API documentation and is also meant as a source for copy-and-paste code.

DifficultyIntermediate
Duration60-90 min
Prerequisite tutorialsQuick Setup (using CMake) Alphabets in SeqAn Ranges
Recommended reading

Introduction

Aligning biological sequences is a very prominent component in many bioinformatics applications and pipelines. Well known genomic applications that use pairwise sequence alignments are read mapping, genome assembly, variant detection, multiple sequence alignment as well as protein search.

The goal of the pairwise alignment is to obtain an optimal transcript that describes how two DNA sequences are related to each other by means of substitutions, insertions, or deletions. The computed transcript describes then the operations necessary to translate the one sequence into the other, as can be seen in the following picture.

The alignment problem is solved with a dynamic programming (DP) algorithm which runs in \( (\mathcal{O}(n^2))\) time and space. Besides the global alignment approach, many more variations of this DP based algorithm have been developed over time. SeqAn unified all of these approaches into a single DP core implementation which can be extended easily and thus, with all possible configurations, is a very versatile and powerful tool to compute many desired alignment variants.

Computing pairwise alignments

Let us first have a look at an example of computing a global alignment in SeqAn.

1// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
2// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
3// SPDX-License-Identifier: CC0-1.0
4
5#include <utility>
6
11
12int main()
13{
14 using namespace seqan3::literals;
15
16 seqan3::dna4_vector s1 = "ACGTGAACTGACT"_dna4;
17 seqan3::dna4_vector s2 = "ACGAAGACCGAT"_dna4;
18
19 // Configure the alignment kernel.
20 auto config =
22
23 // Invoke the pairwise alignment which returns a lazy range over alignment results.
24 auto results = seqan3::align_pairwise(std::tie(s1, s2), config);
25 auto & res = *results.begin();
26 seqan3::debug_stream << "Score: " << res.score() << '\n';
27}
Provides pairwise alignment function.
Sets the global alignment method.
Definition align_config_method.hpp:119
Sets the scoring scheme for the alignment algorithm.
Definition align_config_scoring_scheme.hpp:42
A data structure for managing and computing the score of two nucleotides.
Definition nucleotide_scoring_scheme.hpp:35
Provides seqan3::debug_stream and related types.
Provides seqan3::dna4, container aliases and string literals.
constexpr auto align_pairwise(sequence_t &&seq, alignment_config_t const &config)
Computes the pairwise alignment for a pair of sequences or a range over sequence pairs.
Definition align_pairwise.hpp:131
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition debug_stream.hpp:37
The SeqAn namespace for literals.
Provides seqan3::nucleotide_scoring_scheme.
T tie(T... args)

In the above example, we want to compute the similarity of two seqan3::dna4 sequences using a global alignment. For this, we need to import the dna4 alphabet, the seqan3::nucleotide_scoring_scheme and the seqan3::align_pairwise in the beginning of our file. We also import the seqan3::debug_stream which allows us to print various types in a nice formatted manner.

At the beginning of the file, we are defining our two DNA sequences s1 and s2. If you feel puzzled about what the _dna4 suffix does, we recommend reading the Alphabets in SeqAn and Ranges before. In line 16-17 we configure the alignment job with the most simplistic configuration possible. In this case, it is a global alignment with edit distance. Later in this tutorial we will give a more detailed description of the configuration and how it can be used. The minimum requirement for computing a pairwise alignment is to specify the alignment method (seqan3::align_cfg::method_local or seqan3::align_cfg::method_global) and the seqan3::align_cfg::scoring_scheme configuration elements. The first one selects the internal algorithm and the second one provides the scoring scheme that should be used to score a pair of sequence characters.

Now we are going to call seqan3::align_pairwise. This interface requires two arguments: a tuple or a range of tuples with exactly two elements and the configuration object. Independent of the number of pairwise alignment jobs submitted to the algorithm it always returns a range over seqan3::alignment_result. Later we will see how we can use a continuation interface that calls a user-defined function rather than iterating over the results sequentially. Finally, we output the score for the computed alignment.

Attention
Just calling pairwise_align does nothing as it returns a range that is evaluated in a lazy manner. Only when calling begin or incrementing the iterator over the range the alignment computation is invoked.

Assignment 1

Copy and paste the minimal working example from above into a cpp file in your working directory and compile and run it. Add the two sequences "ACGTGACTGACT" and "AGGTACGAGCGACACT" to the set and compute all pairwise sequence alignments of the four sequences and output the scores.

Hint You can use the seqan3::views::pairwise_combine to generate all pairwise combinations.
Solution

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
#include <utility>
#include <vector>
int main()
{
using namespace seqan3::literals;
std::vector vec{"ACGTGAACTGACT"_dna4, "ACGAAGACCGAT"_dna4, "ACGTGACTGACT"_dna4, "AGGTACGAGCGACACT"_dna4};
// Configure the alignment kernel.
auto config =
for (auto const & res : seqan3::align_pairwise(seqan3::views::pairwise_combine(vec), config))
seqan3::debug_stream << "Score: " << res.score() << '\n';
}
The main SeqAn3 namespace.
Definition aligned_sequence_concept.hpp:26
Provides seqan3::views::pairwise_combine.

First, we create the vector of seqan3::dna4 sequences. We keep the configuration as is and then modify the initial code to a range-based for loop looping over the alignment results. Since the seqan3::alignment_result is a class template and the template parameters are determined during the configuration step we use auto as the result type. The current result is cached inside of the lazy range and we capture the result as const & in order to not tamper with the result values.

Congratulations, you have computed your first pairwise alignment with SeqAn! As you can see, the interface is really simple, yet the configuration object makes it extremely flexible to conduct various different alignment calculations. In the following chapter, you will learn more about the various configuration possibilities.

Alignment configurations

The configuration object is the core of the alignment interface. It allows to easily configure the alignment algorithm without changing the interface of the pairwise alignment. It uses a seqan3::configuration object to chain different configuration elements together using the logical or-operator ('|'-operator). You can find an overview over the available configurations here. The configurations for the alignment module are available in:

Meta-header for the Alignment / Configuration submodule .

The configuration elements are all classes that wrap the actual information necessary for the configuration of the alignment algorithm. Depending on the configuration specification certain features of the algorithm are enabled or disabled. Moreover, during the initialisation of the algorithm, the best implementation is chosen based on the given configurations. To avoid possible ambiguities with the configurations of other algorithms, the configuration elements for the alignment are defined in the special namespace seqan3::align_cfg.

Global and semi-global alignment

The most straightforward algorithm is the global alignment which can be configured using the seqan3::align_cfg::method_global.

Note
The method configuration must be given by the user as it strongly depends on the application context. It would be wrong for us to assume what the intended default behaviour should be.

The global alignment can be further refined by initialising the seqan3::align_cfg::method_global configuration element with the free end-gap specifiers. They specify whether gaps at the end of the sequences are penalised. In SeqAn you can configure this behaviour for every end, namely for leading and trailing gaps of the first and second sequence. seqan3::align_cfg::method_global is constructed with 4 free end-gap specifiers (one for every end):

The following code snippet demonstrates the different use cases:

Provides global and local alignment configurations.
// Example of a semi-global alignment where leading and trailing gaps in the
// second sequence are not penalised:
A strong type representing free_end_gaps_sequence1_leading of the seqan3::align_cfg::method_global.
Definition align_config_method.hpp:65
A strong type representing free_end_gaps_sequence1_trailing of the seqan3::align_cfg::method_global.
Definition align_config_method.hpp:85
A strong type representing free_end_gaps_sequence2_leading of the seqan3::align_cfg::method_global.
Definition align_config_method.hpp:75
A strong type representing free_end_gaps_sequence2_trailing of the seqan3::align_cfg::method_global.
Definition align_config_method.hpp:95

The order of arguments is fixed and must always be as shown in the example.

Assignment 2

Adapt the code from Assignment 1 to compute the semi-global alignment for the case where both ends of the first sequence can be aligned to gaps without penalising them. Note that in such a semi-global alignment, the first sequence would be aligned as an infix of the second sequence.

Solution

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
#include <utility>
#include <vector>
int main()
{
using namespace seqan3::literals;
std::vector vec{"ACGTGAACTGACT"_dna4, "ACGAAGACCGAT"_dna4, "ACGTGACTGACT"_dna4, "AGGTACGAGCGACACT"_dna4};
// Configure the alignment kernel.
for (auto const & res : seqan3::align_pairwise(seqan3::views::pairwise_combine(vec), config))
seqan3::debug_stream << "Score: " << res.score() << '\n';
}

To accomplish our goal we initialise the method_global option with the free end specifiers for sequence 1 set to true, and those for sequence 2 with false.

Scoring schemes and gap schemes

A scoring scheme can be queried to get the score for substituting two alphabet values using the score member function. Currently, SeqAn supports two scoring schemes: seqan3::nucleotide_scoring_scheme and seqan3::aminoacid_scoring_scheme. You can import them with the following includes:

As the names suggests, you need to use the former when scoring nucleotides and the latter when working with aminoacids. You have already used the seqan3::nucleotide_scoring_scheme in the assignments before. The scoring scheme was default initialised using the Hamming Distance. The scoring schemes can also be configured by either setting a simple scheme consisting of seqan3::match_score and seqan3::mismatch_score or by providing a custom matrix. The amino acid scoring scheme can additionally be initialised with a predefined substitution matrix that can be accessed via the seqan3::aminoacid_similarity_matrix enumeration class.

using namespace seqan3::literals;
// Define a simple scoring scheme with match and mismatch cost and get the score.
auto sc_nc = nc_scheme.score('A'_dna4, 'C'_dna4); // sc_nc == -5.
// Define a amino acid similarity matrix and get the score.
auto sc_aa = aa_scheme.score('M'_aa27, 'K'_aa27); // sc_aa == 2.
A data structure for managing and computing the score of two amino acids.
Definition aminoacid_scoring_scheme.hpp:72
constexpr void set_similarity_matrix(aminoacid_similarity_matrix const matrix_id)
Set the similarity matrix scheme (e.g. blosum62).
Definition aminoacid_scoring_scheme.hpp:118
@ blosum30
The blosum30 matrix for very distantly related proteins.
A strong type of underlying type score_type that represents the score of two matching characters.
Definition scoring_scheme_base.hpp:38
A strong type of underlying type score_type that represents the score two different characters.
Definition scoring_scheme_base.hpp:63
Note
You can also provide your own scoring scheme implementation if it models seqan3::scoring_scheme.

Similarly to the scoring scheme, you can use the seqan3::align_cfg::gap_cost_affine to set the gap penalties used for the alignment computation. The default initialised seqan3::align_cfg::gap_cost_affine sets the score for a gap to -1 and for a gap opening to 0. Note that the gap open score is added to the gap score when a gap is opened within the alignment computation. Therefore setting the gap open score to 0 disables affine gaps. You can pass a seqan3::align_cfg::extension_score and a seqan3::align_cfg::open_score object to initialise the scheme with custom gap penalties. The penalties can be changed later by using the respective member variables extension_score and open_score.

Attention
SeqAn's alignment algorithm computes the maximal similarity score, thus the match score must be set to a positive value and the score for mismatch and gaps must be negative in order to maximise over the matching letters.
// Define a gap scheme with custom gap scores.
int open_score = affine_scheme.open_score; // == -10
int extension_score = affine_scheme.extension_score; // == -1
A configuration element for the affine gap cost scheme.
Definition align_config_gap_cost_affine.hpp:72
A strong type of underlying type int32_t that represents the score (usually negative) of any characte...
Definition align_config_gap_cost_affine.hpp:48
A strong type of underlying type int32_t that represents a score (usually negative) that is incurred ...
Definition align_config_gap_cost_affine.hpp:31

To configure the scoring scheme and the gap scheme for the alignment algorithm you need to use the seqan3::align_cfg::scoring_scheme and the seqan3::align_cfg::gap_cost_affine configurations. The seqan3::align_cfg::scoring_scheme is mandatory - similarly to the alignment method configuration. It would be wrong to assume what the default scoring scheme should be. If you do not provide these configurations, the compilation will fail with a corresponding error message. Not providing the gap scheme is ok. In this case, the default initialised gap scheme will be used for the alignment computation.

Assignment 3

Compute the alignment of the two amino acid sequences listed below using an affine gap scheme with gap costs of -2 and gap open costs of -9. Use the BLOSUM62 similarity matrix.

  • QFSEEILSDIYCWMLQCGQERAV
  • AFLPGWQEENKLSKIWMKDCGCLW
Solution
// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
#include <seqan3/alignment/pairwise/all.hpp> // for seqan3::align_cfg and seqan3::align_pairwise
#include <seqan3/alignment/scoring/all.hpp> // for seqan3::aminoacid_scoring_scheme and
// seqan3::aminoacid_similarity_matrix
#include <seqan3/alphabet/aminoacid/aa27.hpp> // for seqan3::operator""_aa27
int main()
{
using namespace seqan3::literals;
auto seq1 = "QFSEEILSDIYCWMLQCGQERAV"_aa27;
auto seq2 = "AFLPGWQEENKLSKIWMKDCGCLW"_aa27;
// Configure the alignment kernel.
auto config =
for (auto const & res : seqan3::align_pairwise(std::tie(seq1, seq2), config))
seqan3::debug_stream << "Score: " << res.score() << '\n';
}
Provides seqan3::aa27, container aliases and string literals.
Meta-header for the Alignment / Pairwise submodule .
Meta-header for the Alignment / Scoring submodule .
@ blosum62
The blosum62 matrix recommended for most use-cases.
SeqAn specific customisations in the standard namespace.

Only a few parts of our algorithm need to be adapted. First, we use an amino acid scoring scheme and initialise it with the respective similarity matrix. Second, we initialise the gap scheme to represent the affine gap model as given in the assignment. Et voilà, we have computed a pairwise alignment over aminoacid sequences.

Alignment result

So far, we have only used the score. However, in many situations the final alignment is required, e.g. when mapping reads and the user wishes to write the alignment to the final SAM/BAM file. In SeqAn you can simply configure what is going to be computed by the alignment algorithm using the different output configurations.

Provides configuration for alignment output.
// Configure the alignment to only compute the score.
Configures the alignment result to output the score.
Definition align_config_output.hpp:40

Accordingly, the alignment algorithm is configured to use the best implementation to obtain the desired result. The following table shows the different outcomes that can be configured:

Output option Available result
seqan3::align_cfg::output_score alignment score
seqan3::align_cfg::output_end_position end positions of the aligned sequences
seqan3::align_cfg::output_begin_position begin positions of the aligned sequences
seqan3::align_cfg::output_alignment alignment of the two sequences
seqan3::align_cfg::output_sequence1_id id of the first sequence
seqan3::align_cfg::output_sequence2_id id of the second sequence

The final result is returned as a seqan3::alignment_result object. This object offers special member functions to access the stored values. If you try to access a value, e.g. the alignment, although you didn't specify seqan3::align_cfg::output_alignment in the output configuration, a static assertion will be triggered during compilation.

Note
If you don't specify any of the above mentioned output configurations then by default all options are enabled and will be computed. In order to potentially increase the performance of the alignment algorithm only enable those options that are needed for your use case.

Assignment 4

Compute the overlap alignment of the following two sequences. Use a linear gap scheme with a gap score of -4 and a simple scoring scheme with mismatch -2 and match 4.

  • TTACGTACGGACTAGCTACAACATTACGGACTAC
  • GGACGACATGACGTACGACTTTACGTACGACTAGC

Hint An alignment is called overlap alignment if it allows free end-gaps at each sequence end. With this configuration overlaps between two sequences can be computed which is a common use case during the overlap layout consensus assembly.
Solution

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
#include <utility>
int main()
{
using namespace seqan3::literals;
auto seq1 = "TTACGTACGGACTAGCTACAACATTACGGACTAC"_dna4;
auto seq2 = "GGACGACATGACGTACGACTTTACGTACGACTAGC"_dna4;
// Configure the output:
// Configure the alignment kernel together with the previous output configuration.
auto config =
| output_config;
for (auto const & res : seqan3::align_pairwise(std::tie(seq1, seq2), config))
{
seqan3::debug_stream << "Score: " << res.score() << '\n';
seqan3::debug_stream << "Begin: (" << res.sequence1_begin_position() << "," << res.sequence2_begin_position()
<< ")\n";
seqan3::debug_stream << "End: (" << res.sequence1_end_position() << "," << res.sequence2_end_position()
<< ")\n";
seqan3::debug_stream << "Alignment: \n" << res.alignment() << '\n';
}
}
Configures the alignment result to output the alignment.
Definition align_config_output.hpp:168
Configures the alignment result to output the begin positions.
Definition align_config_output.hpp:128
Configures the alignment result to output the end position.
Definition align_config_output.hpp:84

Banded alignment

In many situations it is not necessary to compute the entire alignment matrix but only a part of it. This has positive impacts on the performance. To limit the computation space the alignment matrix can be bounded by a band. Thus, only the alignment is computed that fits in this band. Notably, this alignment does not need to be the optimal alignment. However, in many cases, we can give a rough bound on how similar the sequences will be and therefor use the banded alignment. To do so, you can configure the alignment using the seqan3::align_cfg::band_fixed_size option. This configuration element will be initialised with a seqan3::align_cfg::lower_diagonal and seqan3::align_cfg::upper_diagonal parameter.

Provides seqan3::detail::align_config_band.
// Configure a banded alignment.
Configuration element for setting a fixed size band.
Definition align_config_band.hpp:60
A strong type representing the lower diagonal of the seqan3::align_cfg::band_fixed_size.
Definition align_config_band.hpp:28
A strong type representing the upper diagonal of the seqan3::align_cfg::band_fixed_size.
Definition align_config_band.hpp:39

Assignment 5

Use the example from assignment 4 and compute it in a band with lower diagonal set to -3 and upper diagonal set to 8. How does the result change?

Solution

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
#include <utility>
int main()
{
using namespace seqan3::literals;
auto seq1 = "TTACGTACGGACTAGCTACAACATTACGGACTAC"_dna4;
auto seq2 = "GGACGACATGACGTACGACTTTACGTACGACTAGC"_dna4;
// Configure the output:
// Configure the alignment kernel.
auto config =
| output_config
for (auto const & res : seqan3::align_pairwise(std::tie(seq1, seq2), config))
{
seqan3::debug_stream << "Score: " << res.score() << '\n';
seqan3::debug_stream << "Begin: (" << res.sequence1_begin_position() << "," << res.sequence2_begin_position()
<< ")\n";
seqan3::debug_stream << "End: (" << res.sequence1_end_position() << "," << res.sequence2_end_position()
<< ")\n";
seqan3::debug_stream << "Alignment: \n" << res.alignment() << '\n';
}
}

Edit distance

A special form of the pairwise sequence alignment is the edit distance. This distance metric counts the number of edits necessary to transform one sequence into the other. The cost model for the edit distance is fixed. In particular, the match score is 0 and the scores for a mismatch and a gap is -1. Due to the special metric, a fast bitvector implementation can be used to compute the edit distance. This happens in SeqAn automatically if the respective configurations are used. To do so, you need a scoring scheme initialised with Manhattan distance (at the moment only seqan3::nucleotide_scoring_scheme supports this) and a gap scheme initialised with -1 for a gap and 0 for a gap open score and computing a seqan3::global_alignment. To make the configuration easier, we added a shortcut called seqan3::align_cfg::edit_scheme.

Provides seqan3::align_cfg::edit_scheme.
// Configure an edit distance alignment.
constexpr configuration edit_scheme
Shortcut for edit distance configuration.
Definition align_config_edit.hpp:48

The edit_scheme still has to be combined with an alignment method. When combining it with the seqan3::align_cfg::method_global configuration element, the edit distance algorithm can be further refined with free end-gaps (see section Global and semi-global alignment).

Attention
Only the following free end-gap configurations are supported for the global alignment configuration with the edit scheme:
  • no free end-gaps (all free end-gap specifiers are set to false)
  • free end-gaps for the first sequence (free end-gaps are set to true for the first and to false for the second sequence) Using any other free end-gap configuration will disable the edit distance algorithm, i.e. the fast bitvector algorithm, and will fall back to the standard pairwise alignment.

Refine edit distance

The edit distance can be further refined using the seqan3::align_cfg::min_score configuration to fix an edit score (a limit of the allowed number of edits). If the respective alignment could not find a solution within the given error bound, the resulting score is infinity (corresponds to std::numeric_limits::max). Also the alignment and the begin and end positions of the alignment can be computed using a combination of the seqan3::align_cfg::output_alignment, seqan3::align_cfg::output_begin_position and seqan3::align_cfg::output_end_position options.

Assignment 6

Compute all pairwise alignments from the assignment 1 (only the scores). Only allow at most 7 errors and filter all alignments with 6 or fewer errors.

Hint You can use the std::views::filter to get only those alignments that fit the requirements.

Solution

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
#include <ranges>
#include <utility>
#include <vector>
int main()
{
using namespace seqan3::literals;
std::vector vec{"ACGTGACTGACT"_dna4, "ACGAAGACCGAT"_dna4, "ACGTGACTGACT"_dna4, "AGGTACGAGCGACACT"_dna4};
// Configure the alignment kernel.
auto alignment_results = seqan3::align_pairwise(seqan3::views::pairwise_combine(vec), config);
auto filter_v = std::views::filter(
[](auto && res)
{
return res.score() >= -6;
});
for (auto const & result : alignment_results | filter_v)
{
seqan3::debug_stream << "Score: " << result.score() << '\n';
}
}
Sets the minimal score (maximal errors) allowed during an distance computation e.g....
Definition align_config_min_score.hpp:36

Invalid configurations

Chaining the configurations to build an individual alignment algorithm is a strong advantage of this design. However, some combinations would result in an invalid alignment configuration. To explicitly prevent this we added some security details. First, if a combination is invalid (for example by providing the same configuration more than once) a static assert will inform you about the invalid combination. Here you can find a table depicting the valid configurations. Further, if the seqan3::align_pairwise is called, it checks if the input data can be used with the given configuration. For example, a static assertion is emitted if the alphabet types of the sequences together with the provided scoring scheme do not model the concept seqan3::scoring_scheme_for. Other possible errors are invalid band settings where the initialised band does not intersect with the actual alignment matrix (the lower diagonal starts beyond the end of the first sequence).

Hide me