SeqAn3  3.0.1
The Modern C++ library for sequence analysis.
Alignment

The alignment module contains concepts, algorithms and classes that are related to the computation of pairwise and multiple sequence alignments. More...

+ Collaboration diagram for Alignment:

Modules

 Aligned Sequence
 Provides seqan3::aligned_sequence, as well as various ranges that model it.
 
 Band
 Data structures for computing banded sequence alignments.
 
 Configuration
 Provides configuration elements for the pairwise alignment configuration.
 
 matrix
 Provides data structures for representing alignment coordinates and alignments as a matrix.
 
 Pairwise
 Provides the algorithmic components for the computation of pairwise alignments.
 
 Scoring
 Provides the data structures used for scoring alphabets and sequences.
 

Detailed Description

The alignment module contains concepts, algorithms and classes that are related to the computation of pairwise and multiple sequence alignments.

Introduction

An essential step in almost every bioinformatics application or pipeline is to determine the evolutionary distances of two or more biological sequences (genomic or protein sequences). To get this information on base level resolution one needs to align these sequences. During this alignment step a score is computed which estimates how similar the sequences in question are. Moreover, an alignment transcript can be computed which describes the insertions, deletions and substitutions of bases necessary to transform one sequence into another.

There have been numerous adaptions and modifications of the original global alignment problem to solve similar problems such as the local alignment. Here, the goal is to find a maximal homologue region between two sequences that has been conserved during the evolution. Other examples are the semi-global alignment which is frequently used in read mapping in order to align a smaller sequence into the context of a larger reference sequence.

SeqAn offers a generic multi-purpose alignment library comprising all widely known alignment algorithms as well as many special algorithms. These algorithms are all accessible through an easy to use alignment interface which is described below.

Pairwise alignment

Pairwise sequence alignments can be computed with the free function seqan3::align_pairwise. This function is called in the default case with a sequence pair and an alignment configuration object. Note the type of the pair must model seqan3::tuple_like, e.g. std::tuple or std::pair, with exactly two elements. The algorithm accesses the sequences via the corresponding get interface. Furthermore, the sequences stored in the pair must model std::ranges::viewable_range. This means the type returned by the get interface models either std::ranges::view or std::ranges::range and is an lvalue reference. If you don't know yet what a view or a range is it is recommended to read through the ranges tutorial. The following code snippet demonstrates a simple use of the pairwise alignment interface.

#include <utility>
using seqan3::operator""_dna4;
int main()
{
seqan3::dna4_vector s1 = "ACGTGAACTGACT"_dna4;
seqan3::dna4_vector s2 = "ACGAAGACCGAT"_dna4;
// Configure the alignment kernel.
// Invoke the pairwise alignment which returns a lazy range over alignment results.
auto results = seqan3::align_pairwise(std::tie(s1, s2), config);
auto & res = *results.begin();
seqan3::debug_stream << "Score: " << res.score() << '\n';
}

In this snippet a global alignment over two nucleotide sequences is computed. Here the helper function std::tie is used to pass the two sequences as a tuple to the alignment algorithm. The special interface of std::tie allows to forward the two sequences as lvalue references such that no copy of the data is involved.

There are a lot of applications that need to compute many pairwise sequence alignments. Accordingly, the seqan3::align_pairwise interface offers an overload for ranges over sequence pairs. The following snippet shows a simple use case.

#include <vector>
int main()
{
using seqan3::operator""_dna4;
std::vector data1{"AGTGCTACG"_dna4, "AGTAGACTACG"_dna4, "AGTTACGAC"_dna4};
std::vector data2{"ACGTGCGACTAG"_dna4, "ACGTACGACACG"_dna4, "AGTAGCGATCG"_dna4};
// Configure the alignment kernel.
// Compute the alignment over a range of pairs.
for (auto const & res : seqan3::align_pairwise(seqan3::views::zip(data1, data2), config))
seqan3::debug_stream << "The score: " << res.score() << "\n";
}

In addition to the type requirements above the alignment interface requires std::ranges::random_access_range and std::ranges::sized_range in order to work correctly.

Configuring pairwise alignments

In SeqAn the alignment algorithm can be configured in many different ways. The core of this configuration are the different configuration elements that select specific features of the algorithm. To allow a maximal flexibility the configuration is separated from the alignment interface. This means that before the alignment algorithm is invoked, the algorithm must be configured. The respective alignment configurations are defined in their own namespace called seqan3::align_cfg. This namespace is used to disambiguate configurations for the alignment algorithm with configurations from other algorithms in SeqAn. To compute a pairwise alignment at least two configuration elements must be provided, namely the the seqan3::align_cfg::mode and the seqan3::align_cfg::scoring.

Combining configuration elements

Configurations can be combined using the |-operator. If a combination is invalid, a static assertion is triggered during compilation and will inform the user that the the last config cannot be combined with any of the configs from the left-hand side of the configuration specification. Unfortunately, the names of the invalid types cannot be printed within the static assert, but the following table shows which combinations are possible. In general, the same configuration element cannot occur more than once inside of a configuration specification.

Config 0 1 2 3 4 5 6 7 8 9
0: Aligned ends
1: Band
2: Gap scheme
3: Global alignment
4: Local alignment
5: Max error
6: Parallel
7: Result
8: Scoring scheme
9: Vectorise

Accessing the computed alignment

The seqan3::align_pairwise interface returns a seqan3::alignment_range. This range is a single pass range over the computed alignments and the range's element types are seqan3::alignment_result objects. Even if only a single alignment is computed a range will be returned since it could be possible that one alignment invocation produces multiple results, e.g. to receive suboptimal alignments. The seqan3::alignment_result object contains only the information that has been requested via the alignment configuration. The seqan3::align_cfg::result configuration element allows to limit the parts that are computed in the alignment algorithm. The following table shows the parts that are computed depending on the seqan3::align_cfg::result configuration:

Entity Available result
with_score alignment score
with_back_coordinate alignment score; back coordinate
with_front_coordinate alignment score; back and front coordinate
with_alignment alignment score; back and front coordinate; alignment

The back coordinate stores the end of the alignment within both sequences. Note that theses positions are inclusive. The front coordinate stores the begin of the alignment part in both sequences. With the last option the complete alignment is computed. If seqan3::align_cfg::result is not specified only the score is available. Accessing one of the results that have not been requested will trigger a static assertion informing the developer about the invalid access.

Using scoring and gap schemes

To compute an alignment a scoring and a gap scheme must be provided which give a "score" for substituting, inserting, or deleting a base within the alignment computation. Throughout SeqAn a positive score implies higher similarity and/or a closer relatedness and a lower or even negative score implies distance. If you are used to dealing with "penalties" or "distances", instead think of "negative scores" when using SeqAn interfaces.

Scoring two letters

Scoring two letters of a single alphabet (or two similar alphabets) is performed by scoring schemes. A scoring scheme is any type that models seqan3::scoring_scheme, i.e. it must provide a member function that takes the two letters and returns the scheme-specific score. Algorithms that expect a scoring scheme should check this concept with their respective alphabet(s).

Two generic scoring schemes are provided:

  1. seqan3::nucleotide_scoring_scheme that accepts all nucleotides (and any alphabet that is explicitly convertible to seqan3::dna15)
  2. seqan3::aminoacid_scoring_scheme that accepts all amino acids (and any alphabet that is explicitly convertible to seqan3::aa27).

These also support scoring two nucleotides/amino acids of different types and they also support modification of their scores via set_() functions and by returning references to their internal score matrix. You can however add completely different types, as long as they model seqan3::scoring_scheme.

The scoring scheme can be configured with the seqan3::align_cfg::scoring element. Since the scoring scheme is strongly coupled on the sequences to be aligned it can not be defaulted. Thus, it is mandatory for the the developer to specify the seqan3::align_cfg::scoring configuration.

Scoring gaps

Throughout SeqAn we use the term gap to refer to an individual gap (see Gap) and a gap interval to refer to a stretch of consecutive gaps. When aligning two sequences a gap is introduced to mark an insertion or deletion with respect to the other sequence. However, because it is widely recognised that the likelihood of n consecutive gaps is much higher than that of n individual gaps the scoring of an individual gap or a stretch of gaps is not handled by the scoring scheme. This is based on the assumption that one biological event often introduces more than one gap at a time and that single character gaps are not common due to other biological factors like frame preservation in protein-coding sequences.

SeqAn offers the additional seqan3::gap_scheme which can be used to set the scores for opening or extending a gap.

The gap scheme can be configured with the seqan3::align_cfg::gap element. If the configuration is not specified, the algorithm uses edit distance scores (-1) for deletion/insertion.

Computing banded alignments

SeqAn offers the computation of banded alignments to reduce the running time of the algorithm. This can be helpful if the region in which the optimal alignment exists is known a priori. To specify the banded alignment the developer can use the seqan3::align_cfg::band option initialised with a seqan3::static_band. The seqan3::static_band is constructed with a lower bound and an upper bound. The upper bound must always be greater than or equal to the lower bound. To position the band, imagine a rectangle where the first sequence is written on top and the second sequence along the left vertical side. A negative bound implies a start of the band within the vertical part while a positive bound implies a start within the top part of this rectangle at the respective position.

Global and local alignments

The standard global and local alignments can be configured using the configuration seqan3::align_cfg::mode. The global alignment can be further refined using the seqan3::align_cfg::sequence_ends configuration. This configuration allows to enable, respectively disable the scoring of leading and trailing gaps at the respective sequence ends. This option is not available for the local alignment where scoring gaps at the ends of the sequences is always disabled.

Algorithmic details

By default a generic alignment algorithm is used that supports all valid alignment configurations, but for some special combinations of parameters a notably faster algorithm is available. It is automatically selected if all of the following requirements are satisfied:

There is a special shortcut for the above required configs called seqan3::align_cfg::edit, which can be used to safe some typing.

The edit configuration can be further specialised with following configs:

Note
If there was a configuration that is not suitable for the edit distance algorithm the standard alignment algorithm is executed as a fallback.

Parallel alignment execution

SeqAn's alignment algorithm is internally accelerated using multi-threading. To compute the alignments in parallel one can call seqan3::align_pairwise with a seqan3::par execution policy. This will automatically spawn a thread for every CPU-(core) available and compute the alignments concurrently. By default the algorithm is called with the seqan3::seq execution policy, which means sequential execution of the alignments.

debug_stream.hpp
Provides seqan3::debug_stream and related types.
dna4.hpp
Provides seqan3::dna4, container aliases and string literals.
align_pairwise.hpp
Provides pairwise alignment function.
utility
seqan3::align_cfg::mode
Sets the alignment mode.
Definition: align_config_mode.hpp:94
vector
seqan3::nucleotide_scoring_scheme
A data structure for managing and computing the score of two nucleotides.
Definition: nucleotide_scoring_scheme.hpp:37
nucleotide_scoring_scheme.hpp
Provides seqan3::nucleotide_scoring_scheme.
seqan3::align_pairwise
constexpr auto align_pairwise(sequence_t &&seq, alignment_config_t const &config)
Computes the pairwise alignment for a pair of sequences or a range over sequence pairs.
Definition: align_pairwise.hpp:139
std::tie
T tie(T... args)
seqan3::align_cfg::scoring
Sets the scoring scheme for the alignment algorithm.
Definition: align_config_scoring.hpp:41
all.hpp
Meta-header for the alignment configuration module .
ranges
Adaptations of concepts from the Ranges TS.
seqan3::debug_stream
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition: debug_stream.hpp:39
seqan3::global_alignment
constexpr detail::global_alignment_type global_alignment
Helper variable to select the global alignment.
Definition: align_config_mode.hpp:54