SeqAn3 3.2.0
The Modern C++ library for sequence analysis.
Pairwise

Provides the algorithmic components for the computation of pairwise alignments. More...

+ Collaboration diagram for Pairwise:

Classes

class  seqan3::alignment_result< alignment_result_value_t >
 Stores the alignment results and gives access to score, alignment and the front and end positionss. More...
 

Functions

template<typename sequence_t , typename alignment_config_t >
requires detail::align_pairwise_single_input<sequence_t> && std::copy_constructible<std::remove_reference_t<sequence_t>> && detail::is_type_specialisation_of_v<alignment_config_t, configuration>
constexpr auto seqan3::align_pairwise (sequence_t &&seq, alignment_config_t const &config)
 Computes the pairwise alignment for a pair of sequences or a range over sequence pairs. More...
 

Detailed Description

Provides the algorithmic components for the computation of pairwise alignments.

Introduction to pairwise alignment

Pairwise sequence alignments can be computed with the free function seqan3::align_pairwise. This function is called in the default case with a sequence pair and an alignment configuration object. Note the type of the pair must model seqan3::tuple_like, e.g. std::tuple or std::pair, with exactly two elements. The algorithm accesses the sequences via the corresponding get interface. Furthermore, the sequences stored in the pair must model std::ranges::viewable_range. This means the type returned by the get interface models either std::ranges::view or std::ranges::range and is an lvalue reference. If you don't know yet what a view or a range is, it is recommended to read through the ranges tutorial. The following code snippet demonstrates a simple use of the pairwise alignment interface.

#include <utility>
int main()
{
using namespace seqan3::literals;
seqan3::dna4_vector s1 = "ACGTGAACTGACT"_dna4;
seqan3::dna4_vector s2 = "ACGAAGACCGAT"_dna4;
// Configure the alignment kernel.
auto config =
// Invoke the pairwise alignment which returns a lazy range over alignment results.
auto results = seqan3::align_pairwise(std::tie(s1, s2), config);
auto & res = *results.begin();
seqan3::debug_stream << "Score: " << res.score() << '\n';
}
Provides pairwise alignment function.
Sets the global alignment method.
Definition: align_config_method.hpp:122
Sets the scoring scheme for the alignment algorithm.
Definition: align_config_scoring_scheme.hpp:45
A data structure for managing and computing the score of two nucleotides.
Definition: nucleotide_scoring_scheme.hpp:38
Provides seqan3::debug_stream and related types.
Provides seqan3::dna4, container aliases and string literals.
constexpr auto align_pairwise(sequence_t &&seq, alignment_config_t const &config)
Computes the pairwise alignment for a pair of sequences or a range over sequence pairs.
Definition: align_pairwise.hpp:134
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition: debug_stream.hpp:37
The SeqAn namespace for literals.
Provides seqan3::nucleotide_scoring_scheme.
T tie(T... args)

In this snippet a global alignment over two nucleotide sequences is computed. Here the helper function std::tie is used to pass the two sequences as a tuple to the alignment algorithm. The special interface of std::tie allows to forward the two sequences as lvalue references such that no copy of the data is involved.

There are a lot of applications that need to compute many pairwise sequence alignments. Accordingly, the seqan3::align_pairwise interface offers an overload for ranges over sequence pairs. The following snippet shows a simple use case.

#include <ranges>
#include <vector>
int main()
{
using namespace seqan3::literals;
std::vector data1{"AGTGCTACG"_dna4, "AGTAGACTACG"_dna4, "AGTTACGAC"_dna4};
std::vector data2{"ACGTGCGACTAG"_dna4, "ACGTACGACACG"_dna4, "AGTAGCGATCG"_dna4};
// Configure the alignment kernel.
auto config =
// Compute the alignment over a range of pairs.
for (auto const & res : seqan3::align_pairwise(seqan3::views::zip(data1, data2), config))
seqan3::debug_stream << "The score: " << res.score() << "\n";
}
Meta-header for the Alignment / Configuration submodule .
constexpr auto zip
A view adaptor that takes several views and returns tuple-like values from every i-th element of each...
Definition: zip.hpp:573
Attention
In addition to the type requirements above the alignment interface requires that the passed sequences model std::ranges::random_access_range and std::ranges::sized_range in order to work correctly, e.g. a std::vector.

Configuring pairwise alignments

In SeqAn the alignment algorithm can be configured in many different ways. The core of this configuration are the different configuration elements that select specific features of the algorithm. To allow a maximal flexibility the configuration is separated from the alignment interface. This means the algorithm must be configured before it is invoked. The respective alignment configurations are defined in their own namespace called seqan3::align_cfg. This namespace is used to disambiguate configurations for the alignment algorithm with configurations from other algorithms in SeqAn.

To compute a pairwise alignment at least two configuration elements must be provided: The alignment method and the scoring scheme. Thus, a valid alignment configuration must specify what kind of alignment shall be computed, as it strongly depends on the corresponding context. A default wouldn't make much sense here. For similar reasons the scoring scheme has to be always provided by the user.

Global and local alignments

There have been numerous adaptions and modifications of the original global alignment problem to solve similar problems such as the local alignment. Here, the goal is to find a maximal homologue region between two sequences that has been conserved during the evolution. Other examples are the semi-global alignment which is frequently used in read mapping in order to align a smaller sequence into the context of a larger reference sequence.

The standard global and local alignments can be configured using seqan3::align_cfg::method_global and seqan3::align_cfg::method_local, respectively.

A semi-global alignment can be computed by specifying the free end-gaps in the constructor of seqan3::align_cfg::method_global. The parameters enable, respectively disable, the scoring of leading and trailing gaps at the respective sequence ends ( first sequence leading, second sequence leading or first sequence trailing, second sequence trailing gaps). The SeqAn alignment algorithm allows any free end-gap settings making it a very versatile algorithm. This option, however, is not available for the local alignment where penalising gaps at the ends of the sequences is always disabled.

Global Alignment:

--T--CC-C-AGT--TATGT-CAGGGGACACG-A-GCATGCAGA-GAC
| || | || | | | ||| || | | | | |||| |
AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG-T-CAGAT--C

Finding the optimal global alignment of two sequences is solved by the Needleman-Wunsch algorithm.
Semi-global Alignment (e.g. overlapping sequences):

TCCCAGTTATGTCAGgggacacgagcatgcagagac
|||||||||||||||
aattgccgccgtcgttttTCCCAGTTATGTCAG

The semi-global alignment is a specially configured global alignment, namely we do not penalize gaps at the ends of the alignment. Semi-global alignments are often used in genome assembly applications when trying to find matching overlaps.

Local Alignment (better suited to find conserved segments):

tccCAGTTATGTCAGgggacacgagcatgcagagac
||||||||||||
aattgccgccgtcgttttcagCAGTTATGTCAGatc

A local alignment is effectively a global alignment of two partial sequences. For example when two genes from different species are similar in short conserved regions and dissimilar in the remaining regions. A global alignment would not find the local matching because it would try to align the entire sequence. This is solved by the Smith-Waterman algorithm.

Using scoring and gap schemes

To compute an alignment a scoring and a gap scheme must be provided which give a "score" for substituting, inserting, or deleting a base within the alignment computation. Throughout SeqAn, a positive score implies higher similarity and/or a closer relatedness and a lower or even negative score implies distance. If you are used to dealing with "penalties" or "distances", instead think of "negative scores" when using SeqAn interfaces.

Scoring two letters

Scoring two letters of a single alphabet (or two similar alphabets) is performed by scoring schemes. A scoring scheme is any type that models seqan3::scoring_scheme_for, i.e. it must provide a member function that takes the two letters and returns the scheme-specific score. Algorithms that expect a scoring scheme should check this concept with their respective alphabet(s).

Two generic scoring schemes are available in SeqAn:

  1. seqan3::nucleotide_scoring_scheme that accepts all nucleotides (and any alphabet that is explicitly convertible to seqan3::dna15)
  2. seqan3::aminoacid_scoring_scheme that accepts all amino acids (and any alphabet that is explicitly convertible to seqan3::aa27).

These also support scoring two nucleotides/amino acids of different types and they also support modification of their scores via set_() functions and by returning references to their internal score matrix. You can however add completely different types, as long as they model seqan3::scoring_scheme_for for the respective sequence alphabet type. In fact, when invoking the seqan3::align_pairwise interface it will be checked at compile time if the provided scoring scheme can be used in combination with the passed sequences and if not a static assertion is raised.

The scoring scheme can be configured with the seqan3::align_cfg::scoring_scheme element. Since the scoring scheme is strongly coupled on the sequences to be aligned, there is no default for it. Thus, it is mandatory for the developer to specify this configuration.

Scoring gaps

Throughout SeqAn we use the term gap to refer to an individual gap (see Gap) and a gap interval to refer to a stretch of consecutive gaps. When aligning two sequences a gap is introduced to mark an insertion or deletion with respect to the other sequence. However, because it is widely recognised that the likelihood of n consecutive gaps is much higher than that of n individual gaps the scoring of an individual gap or a stretch of gaps is not handled by the scoring scheme. This is based on the assumption that one biological event often introduces more than one gap at a time and that single character gaps are not common due to other biological factors like frame preservation in protein-coding sequences.

SeqAn offers the additional seqan3::gap_cost_affine which can be used to set the scores for opening (seqan3::align_cfg::open_score) or extending a gap (seqan3::align_cfg::extension_score).

Combining configuration elements

Configurations can be combined using the |-operator. If a combination is invalid, a static assertion is raised during the compilation of the program. It will inform the user that some configurations cannot be combined together into one alignment configuration. In general, the same configuration element cannot occur more than once inside of a configuration specification. The following table shows which combinations are possible.

Config 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0: Band
1: Gap scheme affine
2: Min score
3: Method global
4: Method local
5: Alignment output
6: End positions output
7: Begin positions output
8: Score output
9: Sequence1 id output
10: Sequence2 id output
11: Parallel
12: Score type
13: Scoring scheme
14: Vectorised

Accessing the alignment results

The seqan3::align_pairwise interface returns a seqan3::algorithm_result_generator_range. This range is a lazy single pass input range over the computed alignments and the range's element types are seqan3::alignment_result objects. Even if only a single alignment is computed a range will be returned since it could be possible that one alignment invocation produces multiple results, e.g. to receive suboptimal alignments. The seqan3::alignment_result object contains only the information that has been requested via the output configuration (see below). The algorithm will then choose the most efficient implementation to compute the requested outputs. The following table shows the different output configurations:

Output option Available result
seqan3::align_cfg::output_score alignment score
seqan3::align_cfg::output_end_position end positions of the aligned sequences
seqan3::align_cfg::output_begin_position begin positions of the aligned sequences
seqan3::align_cfg::output_alignment alignment of the two sequences
seqan3::align_cfg::output_sequence1_id id of the first sequence
seqan3::align_cfg::output_sequence2_id id of the second sequence

The begin and end positions refer to the begin and end positions of the slices of the original sequences that are aligned. For example, the positions reported for the global alignment correspond to the positions of the original sequences since the entire sequences are encompassed by the global alignment. In case of a local alignment the aligned part might only encompass a part of the original sequences. In this case, the begin and end positions denote the begin and end of the slices of the original sequences that are aligned. To obtain the actual alignment the option seqan3::align_cfg::output_alignment has to be specified. The options can be combiend with each other in order to customise the alignment algorithm and the respective output of the alignment. For example computing the alignment will always incur some run time penalty compared to just computing the score. See the following snippet for some examples:

#include <vector>
int main()
{
using namespace seqan3::literals;
// Basic alignment algorithm configuration.
std::pair p{"ACGTAGC"_dna4, "AGTACGACG"_dna4};
// Compute only the score:
seqan3::debug_stream << res << "\n"; // prints: {score: -4}
// Compute only the alignment:
seqan3::debug_stream << res << "\n"; // prints: {alignment: (ACGTA-G-C-,A-GTACGACG)}
// Compute the score and the alignment:
for (auto res :
seqan3::debug_stream << res << "\n"; // prints: {score: -4, alignment: (ACGTA-G-C-,A-GTACGACG)}
// By default compute everything:
for (auto res : seqan3::align_pairwise(p, config))
<< res << "\n"; // prints {id: 0, score: -4, begin: (0,0), end: (7,9) alignment: (ACGTA-G-C-,A-GTACGACG)}
}
Provides seqan3::align_cfg::edit_scheme.
Configures the alignment result to output the alignment.
Definition: align_config_output.hpp:171
Configures the alignment result to output the score.
Definition: align_config_output.hpp:43
constexpr configuration edit_scheme
Shortcut for edit distance configuration.
Definition: align_config_edit.hpp:51

If none of the above configuration was set by the user, then all output options will be enabled by default, i.e. the alignment algorithm will compute every output. Otherwise, if any of the output configurations was set by the user, then only the configured ones are available in the final seqan3::alignment_result. Trying to access an output which has not been configured will raise a static assertion informing the developer about the invalid access.

Note
Currently, the sequence ids are represented by an internal mechanism and might not refer to the actual id of the underlying sequences in the respective alignment, rather it is an ongoing number identifying the computed pair of sequences. In the future, there will be a mechanism for the user to specify the id of the sequences.

Algorithmic details

Since both algorithms (Smith-Waterman and Needleman-Wunsch algorithm) are based on dynamic programming, they run in quadratic time and memory O(nm) (where n and m are the lengths of the respective aligned sequences).

To reduce the time complexity you can use a banded alignment. It reduces the runtime by a constant although remaining quadratic and limiting the possible solutions slightly. You can speed up the computation significantly if you parallelize and simdify your alignment. More about banded and parallelization can be read below.

By default a generic alignment algorithm is used that supports all valid alignment configurations but for some special combinations of parameters a notably faster algorithm is available. It is automatically selected if all of the following requirements are satisfied:

There is a special shortcut for the above required scoring/gap configurations called seqan3::align_cfg::edit_scheme, which can be used to safe some typing.

The edit configuration can be further specialised with following configs:

Note
If there was a configuration that is not suitable for the edit distance algorithm the standard alignment algorithm is executed as a fallback.

Computing banded alignments

SeqAn offers the computation of banded alignments to reduce the running time of the algorithm. This can be helpful if the region in which the optimal alignment exists is known a priori. To specify the banded alignment the developer can use the seqan3::align_cfg::band_fixed_size option.
This band configuration is initialised with a seqan3::align_cfg::lower_diagonal and a seqan3::align_cfg::upper_diagonal. The term diagonal is used to describe the position of the band boundary within the alignment matrix. The given value represents the offset that the lower, respectively upper, diagonal is shifted from the main diagonal, which starts in the origin of the alignment matrix. Accordingly, a negative value shifts the band boundary downwards in the alignment matrix and a positive value shifts the band boundary to the right.

The band parameters might be restricted depending on the configured alignment algorithm, e.g. the origin of the alignment matrix and the sink (the last cell in the last column) must be covered by the band when a global alignment is ought to be computed.
In general, the upper diagonal must always be greater than or equal to the lower diagonal to specify a valid band.

Parallel alignment execution

SeqAn's pairwise sequence alignment algorithm is internally accelerated using multi-threading. The parallel execution can be selected by specifying the seqan3::align_cfg::parallel configuration element. This will enable the asynchronous execution of the alignments in the backend. For the user interface nothing changes as the returned seqan3::algorithm_result_generator_range will preserve the order of the computed alignment results, i.e. the first result corresponds to the first alignment given by the input range. By default, a thread pool with std::thread::hardware_concurrency many threads will be created on a call to seqan3::align_pairwise and destructed when all alignments have been processed and the seqan3::algorithm_result_generator_range goes out of scope. The configuration element seqan3::align_cfg::parallel can be initialised with a custom thread count which determines the number of threads that will be spawned in the background.
Note that only independent alignment computations can be executed in parallel, i.e. you use this method when computing a batch of alignments rather than executing them separately.
Depending on your processor architecture you can gain a significant speed-up.

User callback

In some cases, for example when executing the alignments in parallel, it can be beneficial for the performance to use a continuation interface rather than collecting the results first through the seqan3::algorithm_result_generator_range. To be more precise, if more work needs to be done after the alignment has been computed, it could be better to stay within the thread and continue the work rather than buffering the result and computing the next alignment. The alignment algorithm allows the user to specify their own callback function which will be invoked by the alignment algorithm when a seqan3::alignment_result has been computed. To do so, the seqan3::align_cfg::on_result configuration element can be used during the alignment configuration. Note that if seqan3::align_cfg::on_result is specified, the algorithm seqan3::align_pairwise does not return a seqan3::algorithm_result_generator_range anymore. In fact, the algorithm's return type is void. The following code snippet illustrates this behavior:

#include <mutex>
#include <vector>
int main()
{
// Generate some sequences.
using namespace seqan3::literals;
std::vector<sequence_pair_t> sequences{100, {"AGTGCTACG"_dna4, "ACGTGCGACTAG"_dna4}};
// Use edit distance with 4 threads.
auto const alignment_config =
// Compute the alignments in parallel and output them in order based on the input.
for (auto && result : seqan3::align_pairwise(sequences, alignment_config))
seqan3::debug_stream << result << '\n';
// prints:
// [id: 0 score: -4]
// [id: 1 score: -4]
// [id: 2 score: -4]
// [id: 3 score: -4]
// [id: 4 score: -4]
// [id: 5 score: -4]
// ...
// [id: 98 score: -4]
// [id: 99 score: -4]
// Compute the alignments in parallel and output them unordered using the callback (order is not deterministic).
std::mutex write_to_debug_stream{}; // Need mutex to synchronise the output.
auto const alignment_config_with_callback =
alignment_config
| seqan3::align_cfg::on_result{[&](auto && result)
{
std::lock_guard sync{write_to_debug_stream}; // critical section
seqan3::debug_stream << result << '\n';
}};
seqan3::align_pairwise(sequences, alignment_config_with_callback); // seqan3::align_pairwise is now declared void.
// might print:
// [id: 0 score: -4]
// [id: 1 score: -4]
// [id: 2 score: -4]
// [id: 6 score: -4]
// [id: 7 score: -4]
// [id: 3 score: -4]
// ...
// [id: 99 score: -4]
// [id: 92 score: -4]
}
Configuration element to provide a user defined callback function for the alignment.
Definition: align_config_on_result.hpp:54
seqan3::detail::parallel_mode< std::integral_constant< seqan3::detail::align_config_id, seqan3::detail::align_config_id::parallel > > parallel
Enables the parallel execution of the alignment algorithm if possible for the given configuration.
Definition: align_config_parallel.hpp:38
See also

Function Documentation

◆ align_pairwise()

template<typename sequence_t , typename alignment_config_t >
requires detail::align_pairwise_single_input<sequence_t> && std::copy_constructible<std::remove_reference_t<sequence_t>> && detail::is_type_specialisation_of_v<alignment_config_t, configuration>
constexpr auto seqan3::align_pairwise ( sequence_t &&  seq,
alignment_config_t const &  config 
)
constexpr

Computes the pairwise alignment for a pair of sequences or a range over sequence pairs.

Template Parameters
sequence_tThe type of sequence pairs (see details for more information on the type constraints).
alignment_config_tThe type of the alignment configuration; must be a seqan3::configuration.
Parameters
[in]seqA tuple with two sequences or a range over such tuples.
[in]configThe object storing the alignment configuration.
Returns
A seqan3::algorithm_result_generator_range.

This function computes the pairwise alignment for the given sequences. During the setup phase the most efficient implementation is selected depending on the configurations stored in the given seqan3::configuration object. The configuration also holds settings for parallel or vectorised execution.

Compute a single alignment

In cases where only a single alignment is to be computed, the two sequences can be passed as a pair. The pair can be any class template that models the seqan3::tuple_like concept. The tuple elements must model std::ranges::viewable_range and std::copy_constructible. The following example demonstrates how an alignment is computed from a std::pair of sequences.

std::pair p{"ACGTAGC"_dna4, "AGTACGACG"_dna4};
auto result = seqan3::align_pairwise(p, config);

Alternatively, you can use std::tie as shown in the example below:

std::vector vec{"ACCA"_dna4, "ATTA"_dna4};
auto result = seqan3::align_pairwise(std::tie(vec[0], vec[1]), config);

Compute multiple alignments

In many cases one needs to compute multiple pairwise alignments. Accordingly, the align_pairwise interface allows to pass a range over sequence pairs. The alignment algorithm will be configured only once for all submitted alignments and then computes the alignments sequentially or in parallel depending on the given configuration. Since there is always a certain amount of initial setup involving runtime checks required, it is advisable to pass many sequence-pairs to this algorithm instead of repeatedly calling it with a single pair.

Accessing the alignment results

For each sequence pair one or more seqan3::alignment_results can be computed. The seqan3::align_pairwise function returns an seqan3::algorithm_result_generator_range which can be used to iterate over the alignments. If the vectorised configurations are omitted the alignments are computed on-demand when iterating over the results. In case of a parallel execution all alignments are computed at once in parallel when calling begin on the associated seqan3::algorithm_result_generator_range.

The following snippets demonstrate the single element and the range based interface.

std::vector vec{std::pair{"AGTGCTACG"_dna4, "ACGTGCGACTAG"_dna4},
std::pair{"AGTAGACTACG"_dna4, "ACGTACGACACG"_dna4},
std::pair{"AGTTACGAC"_dna4, "AGTAGCGATCG"_dna4}};
// Compute the alignment of a single pair.
for (auto const & res : seqan3::align_pairwise(std::tie(vec[0].first, vec[0].second), edit_config))
seqan3::debug_stream << "The score: " << res.score() << "\n";
// Compute the alignment over a range of pairs.
for (auto const & res : seqan3::align_pairwise(vec, edit_config))
seqan3::debug_stream << "The score: " << res.score() << "\n";
}

Exception

Strong exception guarantee.

Might throw std::bad_alloc if it fails to allocate the alignment matrix or seqan3::invalid_alignment_configuration if the configuration is invalid. Throws std::runtime_error if seqan3::align_cfg::parallel has been specified without a thread_count value.

Complexity

The complexity depends on the configured algorithm. For the edit distance the following worst case over two input sequences of size $ N $ can be assumed:

Computing Runtime Space
score $ O(N^2/w) $ $ O(w) $
back coordinate $ O(N^2/w) $ $ O(w) $
front coordinate $ O(N^2/w) $ $ O(N^2/w) $
alignment $ O(N^2/w) $ $ O(N^2/w) $

$ w $ is the size of a machine word.

For all other algorithms that compute the standard dynamic programming algorithm the following worst case holds:

Computing Runtime Space
score $ O(N^2) $ $ O(N) $
back coordinate $ O(N^2) $ $ O(N) $
front coordinate $ O(N^2) $ $ O(N^2) $
alignment $ O(N^2) $ $ O(N^2) $

In the banded case the worst case is modified as follows:

Computing Runtime Space
score $ O(N*k) $ $ O(k) $
back coordinate $ O(N*k) $ $ O(k) $
front coordinate $ O(N*k) $ $ O(N*k) $
alignment $ O(N*k) $ $ O(N*k) $

$ k $ is the size of the band.

Thread safety

This function is re-entrant, i.e. it is always safe to call in parallel with different inputs. It is thread-safe, i.e. it is safe to call in parallel with the same input under the condition that the input sequences do not change when being iterated over.