SeqAn3  3.0.3 The Modern C++ library for sequence analysis.
Minimiser

This tutorial introduces minimisers. Minimisers are a compact representation of a DNA or a RNA sequence that is closely related to, but more efficient than a representation by k-mers. The minimisers are implemented as a view in SeqAn3. For more information about views and how to implement your own view, please see
Ranges and How to write a view .

Difficulty Easy 20 min Ranges, How to write a view

Motivation

A common way to work with sequences is to obtain their k-mers, but even with a large size of k, the number of k-mers is non-trivial, therefore storing k-mers is costly. At the same time, due to the overlap of consecutive k-mers, the information they contain is highly redundant. That is why minimisers come in handy. Minimisers are k-mers, which are a minimal value in a window of a specified size. Minimal could for example mean lexicographically smallest. By storing only these minimal k-mers the storage cost is significantly reduced while maintaining a similar amount of information.

Minimiser Workflow

Because minimisers are minimal k-mers, they depend on a given k-mer size, a given shape (specifying which positions should be considered) and a window size, which has to be greater or equal than the k-mer size. If all these values are given, then the minimisers can be obtained by determining all k-mers in the forward and in the backward strand for one window. Only the lexicographically smallest k-mer in one window is saved, then the window is shifted by one and the procedure is repeated until all windows have been processed. If two consecutive windows share the same minimiser, it is stored only once.

The following figure shows three examples, where a) and b) differ only in their window size, while c) introduces a gapped k-mer. All positions marked with a ’.’ are gaps and all k-mers in lower case come from the reverse strand. As the example shows, k-mer size, shape and window size influence the total amount of minimisers.

Non-lexicographical Minimisers

When sliding the window over the sequence, it might happen that consecutive minimisers differ only slightly. For instance, when a minimiser starts with a repetition of A’s, it is highly likely that the minimiser of the next window will start with a repetition of A’s too, which will then just be one A shorter. This may go on for multiple window shifts, depending on how long the repetition is. Saving these only slightly different minimisers makes no sense because they contain no new information about the underlying sequence. Additionally, sequences with a repetition of A’s will be seen as more similar to each other than they actually are. As Marçais et al. have shown, randomizing the order of the k-mers can solve this problem.

Robust Winnowing

In case there are multiple minimal values within one window, the minimum and therefore the minimiser is ambiguous. We choose the rightmost value as the minimiser of the window, and when shifting the window, the minimiser is only changed if there appears a value that is strictly smaller than the current minimum. This approach is termed robust winnowing by Chirag et al. and is proven to work especially well on repeat regions.

Usage in SeqAn3

The minimisers are implemented in `seqan3::views::minimiser_hash`. The function requires three arguments: The sequence, the shape and the window size. The above-mentioned k-mer size is implicitly given by the size of the shape. That is all the information you need to obtain minimisers with SeqAn3, so let's just dive right into the first assignment!

Assignment 1: Fun with minimisers I

Task: Obtain the minimisers for "CCACGTCGACGGTT" with an ungapped shape of size 4 and a window size of 8.
Solution

using seqan3::operator""_dna4;
int main()
{
std::vector<seqan3::dna4> text{"CCACGTCGACGGTT"_dna4};
// Here a consecutive shape with size 4 (so the k-mer size is 4) and a window size of 8 is used.
// results in: [10322096095657499240, 10322096095657499142, 10322096095657499224]
// representing the k-mers [GTAC, TCGA, GACG]
seqan3::debug_stream << minimisers << '\n';
}
A class that defines which positions of a pattern to hash.
Definition: shape.hpp:60
Provides seqan3::debug_stream and related types.
Provides seqan3::dna4, container aliases and string literals.
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition: debug_stream.hpp:42
constexpr auto minimiser_hash
Computes minimisers for a range with a given shape, window size and seed.
Definition: minimiser_hash.hpp:183
Provides seqan3::views::minimiser_hash.
A strong type of underlying type uint8_t that represents the ungapped shape size.
Definition: shape.hpp:25
strong_type for the window_size.
Definition: minimiser_hash.hpp:30

If you have completed the assignment, you probably wonder what these large numbers mean. As explained above, the lexicographical ordering is less optimal, therefore, SeqAn3 uses a seed to randomize the order. To do so, SeqAn3 simply XORs the hash values with a random seed (Default: 0x8F3F73B5CF1C9ADE). How would you get back the actual hash values then? Well, you just use XOR again!

using seqan3::operator""_dna4;
int main()
{
std::vector<seqan3::dna4> text{"CCACGTCGACGGTT"_dna4};
// Here a consecutive shape with size 4 (so the k-mer size is 4) and a window size of 8 is used.
// results in: [10322096095657499240, 10322096095657499142, 10322096095657499224]
// representing the k-mers [GTAC, TCGA, GACG]
seqan3::debug_stream << minimisers << '\n';
// Get hash values
uint64_t seed = 0x8F3F73B5CF1C9ADE; // The default seed from minimiser_hash
// Use XOR on all minimiser values
auto hash_values = minimisers | std::views::transform([seed] (uint64_t i) {return i ^ seed; });
seqan3::debug_stream << hash_values << '\n'; // results in: [182, 216, 134]
}
decltype(detail::transform< trait_t >(list_t{})) transform
Apply a transformation trait to every type in the list and return a seqan3::type_list of the results.
Definition: traits.hpp:434

From these hash values, you can obtain the sequence they are representing by transforming the numbers to base 4. (For example, 182 is "2312" in base four and therefore represents "GTCG".)

Now take a closer look at the resulting minimiser sequences. Are they what you would expect? Probably not, these are not the values presenting the minimisers shown in the example above. Why not?

Hint The example above is based on a lexicographical ordering, our first assignment is not.

So, what would we need to change to achieve the results from our example? We need to use a different seed. SeqAn3 makes this simple by allowing us to add another input parameter of type `seqan3::seed`.

Assignment 2: Fun with minimisers II

Task: Reproduce the results from the example above by obtaining the minimisers for "CCACGTCGACGGTT".

To do so, you have to know which seed to use.

Hint Because the randomization is based on an XOR, you have to use a value X which XOR-ed with a hash value results in the same hash value.

Hint 10001 XOR 0 = 10001

Solution

using seqan3::operator""_dna4;
using seqan3::operator""_shape;
int main()
{
std::vector<seqan3::dna4> text{"CCACGTCGACGGTT"_dna4};
// Here a consecutive shape with size 4 (so the k-mer size is 4) and a window size of 4 is used. The seed is set
// to 0, so lexicographical ordering is used.
// results in: [81, 70, 27, 109, 97, 216, 97, 109, 26, 22, 5]
// representing the k-mers [CCAC, CACG, ACGT, CGTC, cgac, TCGA, CGAC, cgtc, ACGG, accg, aacc]
seqan3::debug_stream << example_a << '\n';
// results in: [27, 97, 26, 22, 5] representing the k-mers [ACGT, CGAC, ACGG, accg, aacc]
seqan3::debug_stream << example_b << '\n';
auto example_c = text | seqan3::views::minimiser_hash(0b10101_shape,
// results in: [9, 18, 7, 6] representing the k-mers [A.G.C, C.A.G, a.c.t, a.c.g]
seqan3::debug_stream << example_c << '\n';
}
strong_type for seed.
Definition: minimiser_hash.hpp:24

Ignoring the backward strand

So far, we have always considered the forward and the backward strand of a sequence, but there might be cases where the backward strand should not be considered. If this is the desired behaviour then the `seqan3::views::minimiser` needs to be used. Unlike the `seqan3::views::minimiser_hash`, `seqan3::views::minimiser` does not hash the values for you, so you have to do this yourself. But fear not, `seqan3::views::kmer_hash` makes this really easy for you!

constexpr auto minimiser
Computes minimisers for a range of comparable values. A minimiser is the smallest value in a window.
Definition: minimiser.hpp:586
constexpr auto kmer_hash
Computes hash values for each position of a range via a given shape.
Definition: kmer_hash.hpp:788

This syntax will result in minimisers with k-mer size 4 and a window-length of 8 (5 + 4 - 1). (So, to determine the right window size you always have to determine the window size you would like to use in `seqan3::views::minimiser_hash` and subtract it by the k-mer size and add 1, here: 8 - 4 + 1 = 5.) Got it? Then let's make a simple test, what is the actual window size of the following:

Hint 1 + 4 - 1 = 4. So, k-mer size and window size are equal.

Wait a second, what does it mean, if k-mer size and window size are equal and we are not considering the backward strand?

Hint `seqan3::views::kmer_hash(seqan3::ungapped{4}) | seqan3::views::minimiser(1)` will return the same values as `seqan3::views::kmer_hash`, because if k-mer size and window size is equal, there is only one comparison to determine the minimal k-mer, the comparison between forward and backward strand. Therefore, if the backward strand is not considered, the minimisers are all k-mers from the given sequence.

In order to ensure that this is the desired behaviour, using `seqan3::views::minimiser(1)` is prohibited. Please, use `seqan3::views::kmer_hash` instead.

Last but not least, `seqan3::views::kmer_hash` and `seqan3::views::minimiser` do not have a seed parameter. So, in order to obtain a random ordering, you have to XOR the view yourself. This can be done with the following command:

auto minimisers = text | seqan3::views::kmer_hash(seqan3::ungapped{4})
| std::views::transform([seed] (uint64_t i)
{return i ^ seed;})

Assignment 3: Fun with minimisers III

Task: Repeat assignment 2 but this time do not consider the backward strand.
Solution

using seqan3::operator""_dna4;
using seqan3::operator""_shape;
int main()
{
std::vector<seqan3::dna4> text{"CCACGTCGACGGTT"_dna4};
// This would lead to an static assert error, because the shape.size() equals the window size. (Remember, the input
// parameter for the minimiser view is calculated by: window size - k-mer size + 1, here: 4 - 4 + 1 = 1.)
// Therefore, kmer_hash needs to be used.
/* auto example_a = text | seqan3::views::kmer_hash(seqan3::shape{seqan3::ungapped{4}})
| seqan3::views::minimiser(1); */
// results in: [81, 70, 27, 109, 182, 216, 97, 134, 26, 107, 175]
// representing the k-mers [CCAC, CACG, ACGT, CGTC, GTCG, TCGA, CGAC, GACG, ACGG, CGGT, GGTT]
seqan3::debug_stream << example_a << '\n';
// results in: [27, 97, 26] representing the k-mers [ACGT, CGAC, ACGG]
seqan3::debug_stream << example_b << '\n';
auto example_c = text | seqan3::views::kmer_hash(0b10101_shape)
// results in: [9, 18, 11] representing the k-mers [A.G.C, C.A.G, A.G.T]
seqan3::debug_stream << example_c << '\n';
}
Provides seqan3::views::kmer_hash.
Provides seqan3::views::minimiser.