SeqAn3 3.3.0-rc.1
The Modern C++ library for sequence analysis.
Views

Search related views. More...

+ Collaboration diagram for Views:

Classes

struct  seqan3::seed
 strong_type for seed. More...
 
struct  seqan3::window_size
 strong_type for the window_size. More...
 

Variables

constexpr auto seqan3::views::kmer_hash
 Computes hash values for each position of a range via a given shape. More...
 
constexpr auto seqan3::views::minimiser
 Computes minimisers for a range of comparable values. A minimiser is the smallest value in a window. More...
 

Alphabet related views

constexpr auto seqan3::views::minimiser_hash
 Computes minimisers for a range with a given shape, window size and seed. More...
 

Detailed Description

Search related views.

See also
Search

Variable Documentation

◆ kmer_hash

constexpr auto seqan3::views::kmer_hash
inlineconstexpr

Computes hash values for each position of a range via a given shape.

Template Parameters
urng_tThe type of the range being processed. See below for requirements. [template parameter is omitted in pipe notation]
Parameters
[in]urangeThe range being processed. [parameter is omitted in pipe notation]
[in]shapeThe seqan3::shape that determines how to compute the hash value.
Returns
A range of std::size_t where each value is the hash of the resp. k-mer. See below for the properties of the returned range.
Attention
For the alphabet size $\sigma$ of the alphabet of urange and the number of 1s $s$ of shape it must hold that $s>\frac{64}{\log_2\sigma}$, i.e. hashes resulting from the shape/alphabet combination can be represented in an uint64_t.

View properties

Concepts and traits urng_t (underlying range type) rrng_t (returned range type)
std::ranges::input_range required preserved
std::ranges::forward_range required preserved
std::ranges::bidirectional_range preserved
std::ranges::random_access_range preserved
std::ranges::contiguous_range lost
std::ranges::viewable_range required guaranteed
std::ranges::view guaranteed
std::ranges::sized_range preserved
std::ranges::common_range preserved
std::ranges::output_range lost
seqan3::const_iterable_range preserved
std::ranges::range_reference_t seqan3::semialphabet std::size_t

See the views submodule documentation for detailed descriptions of the view properties.

Attention
The Shape is defined from right to left! The mask 0b11111101 applied to "AGAAAATA" is interpreted as "A.AAAATA" (and not "AGAAAA.A") and will return the hash value for "AAAAATA".

Example

using namespace seqan3::literals;
int main()
{
std::vector<seqan3::dna4> text{"ACGTAGC"_dna4};
seqan3::debug_stream << hashes << '\n'; // [6,27,44,50,9]
seqan3::debug_stream << (text | seqan3::views::kmer_hash(seqan3::ungapped{3})) << '\n'; // [6,27,44,50,9]
seqan3::debug_stream << (text | seqan3::views::kmer_hash(0b101_shape)) << '\n'; // [2,7,8,14,1]
// Attention: the Shape is defined from right to left!
// The mask 0b11111101 applied to "AGAAAATA" ("A.AAAATA") will yield
// the same hash value as mask 0b1111111 applied to "AAAAATA".
{
auto text1 = "AGAAAATA"_dna4;
auto text2 = "AAAAATA"_dna4;
seqan3::debug_stream << (text1 | seqan3::views::kmer_hash(0b11111101_shape)) << '\n'; // [12]
seqan3::debug_stream << (text2 | seqan3::views::kmer_hash(0b1111111_shape)) << '\n'; // [12]
}
}
A class that defines which positions of a pattern to hash.
Definition: shape.hpp:60
Provides seqan3::debug_stream and related types.
Provides seqan3::dna4, container aliases and string literals.
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition: debug_stream.hpp:37
constexpr auto kmer_hash
Computes hash values for each position of a range via a given shape.
Definition: kmer_hash.hpp:750
Provides seqan3::views::kmer_hash.
The SeqAn namespace for literals.
A strong type of underlying type uint8_t that represents the ungapped shape size.
Definition: shape.hpp:25

This entity is stable. Since version 3.1.

◆ minimiser

constexpr auto seqan3::views::minimiser
inlineconstexpr

Computes minimisers for a range of comparable values. A minimiser is the smallest value in a window.

Template Parameters
urng_tThe type of the first range being processed. See below for requirements. [template parameter is omitted in pipe notation]
Parameters
[in]urange1The range being processed. [parameter is omitted in pipe notation]
[in]window_sizeThe number of values in one window.
Returns
A range of std::totally_ordered where each value is the minimal value for one window. See below for the properties of the returned range.

A minimiser is the smallest value in a window. For example for the following list of hash values [28, 100, 9, 23, 4, 1, 72, 37, 8] and 4 as window_size, the minimiser values are [9, 4, 1].

The minimiser can be calculated for one given range or for two given ranges, where the minimizer is the smallest value in both windows. For example for the following list of hash values [28, 100, 9, 23, 4, 1, 72, 37, 8] and [30, 2, 11, 101, 199, 73, 34, 900] and 4 as window_size, the minimiser values are [2, 4, 1].

Note that in the interface with the second underlying range the const-iterable property will only be preserved if both underlying ranges are const-iterable.

Robust Winnowing

In case there are multiple minimal values within one window, the minimum and therefore the minimiser is ambiguous. We choose the rightmost value as the minimiser of the window, and when shifting the window, the minimiser is only changed if there appears a value that is strictly smaller than the current minimum. This approach is termed robust winnowing by Chirag et al. and is proven to work especially well on repeat regions.

Example

using namespace seqan3::literals;
int main()
{
std::vector<seqan3::dna4> text{"ACGTAGC"_dna4};
seqan3::debug_stream << hashes << '\n'; // [6,27,44,50,9]
auto minimiser = hashes | seqan3::views::minimiser(4);
seqan3::debug_stream << minimiser << '\n'; // [6,9]
// kmer_hash with gaps, hashes: [2,7,8,14,1], minimiser: [2,1]
/* Minimiser view with two ranges
* The second range defines the hash values from the reverse complement, the second reverse is necessary to put the
* hash values in the correct order. For the example here:
* ACGTAGC | seqan3::views::complement => TGCATCG
* | std::views::reverse => GCTACGT
* | seqan3::views::kmer_hash(seqan3::ungapped{3}) => [39 (for GCA), 28 (for CTA), 49 (for TAC),
* 6 (for ACG), 27 (for CGT)]
* "GCA" is not the reverse complement from the first k-mer in "ACGTAGC", which is "ACG", but "CGT" is.
* Therefore, a second reverse is necessary to find the smallest value between the original sequence and its
* reverse complement.
*/
auto reverse_complement_hashes = text | seqan3::views::complement | std::views::reverse
| seqan3::views::kmer_hash(seqan3::ungapped{3}) | std::views::reverse;
seqan3::debug_stream << reverse_complement_hashes << '\n'; // [27,6,49,28,39]
auto minimiser2 = seqan3::detail::minimiser_view{hashes, reverse_complement_hashes, 4};
seqan3::debug_stream << minimiser2 << '\n'; // [6,6]
}
Provides seqan3::views::complement.
auto const complement
A view that converts a range of nucleotides to their complement.
Definition: complement.hpp:67
constexpr auto minimiser
Computes minimisers for a range of comparable values. A minimiser is the smallest value in a window.
Definition: minimiser.hpp:586
Provides seqan3::views::minimiser.

View properties

Concepts and traits urng_t (underlying range type) rrng_t (returned range type)
std::ranges::input_range required preserved
std::ranges::forward_range required preserved
std::ranges::bidirectional_range lost
std::ranges::random_access_range lost
std::ranges::contiguous_range lost
std::ranges::viewable_range required guaranteed
std::ranges::view guaranteed
std::ranges::sized_range lost
std::ranges::common_range lost
std::ranges::output_range lost
seqan3::const_iterable_range preserved
std::ranges::range_reference_t std::totally_ordered std::totally_ordered

See the views submodule documentation for detailed descriptions of the view properties.

This entity is stable. Since version 3.1.

◆ minimiser_hash

constexpr auto seqan3::views::minimiser_hash
inlineconstexpr

Computes minimisers for a range with a given shape, window size and seed.

Template Parameters
urng_tThe type of the range being processed. See below for requirements. [template parameter is omitted in pipe notation]
Parameters
[in]urangeThe range being processed. [parameter is omitted in pipe notation]
[in]shapeThe seqan3::shape that determines how to compute the hash value.
[in]window_sizeThe window size to use.
[in]seedThe seed used to skew the hash values. Default: 0x8F3F73B5CF1C9ADE.
Returns
A range of size_t where each value is the minimiser of the resp. window. See below for the properties of the returned range.

A sequence can be presented by a small number of k-mers (minimisers). For a given shape and window size all k-mers are determined in the forward strand and the backward strand and only the lexicographically smallest k-mer is returned for one window. This process is repeated over every possible window of a sequence. If consecutive windows share a minimiser, it is saved only once. For example, in the sequence "TAAAGTGCTAAA" for an ungapped shape of length 3 and a window size of 5 the first, the second and the last window contain the same minimiser "AAA". Because the minimisers of the first two consecutive windows also share the same position, storing this minimiser twice is redundant and it is stored only once. The "AAA" minimiser of the last window on the other hand is stored, since it is located at an other position than the previous "AAA" minimiser and hence storing the second "AAA"-minimiser is not redundant but necessary.

Non-lexicographical Minimisers by skewing the hash value with a seed

It might happen that a minimiser changes only slightly when sliding the window over the sequence. For instance, when a minimiser starts with a repetition of A’s, then in the next window it is highly likely that the minimiser will start with a repetition of A’s as well. Because it is only one A shorter, depending on how long the repetition is this might go on for multiple window shifts. Saving these only slightly different minimiser makes no sense because they contain no new information about the underlying sequence. Additionally, sequences with a repetition of A’s will be seen as more similar to each other than they actually are. As Marçais et al. have shown, randomizing the order of the k-mers can solve this problem. Therefore, a random seed is used to XOR all k-mers, thereby randomizing the order. The user can change the seed to any other value he or she thinks is useful. A seed of 0 is returning the lexicographical order.

See also
seqan3::views::minimiser_view
Attention
Be aware of the requirements of the seqan3::views::kmer_hash view.

This entity is experimental and subject to change in the future.

View properties

Concepts and traits urng_t (underlying range type) rrng_t (returned range type)
std::ranges::input_range required preserved
std::ranges::forward_range required preserved
std::ranges::bidirectional_range lost
std::ranges::random_access_range lost
std::ranges::contiguous_range lost
std::ranges::viewable_range required guaranteed
std::ranges::view guaranteed
std::ranges::sized_range lost
std::ranges::common_range lost
std::ranges::output_range lost
seqan3::const_iterable_range preserved
std::ranges::range_reference_t seqan3::semialphabet std::size_t

See the views submodule documentation for detailed descriptions of the view properties.

Example

using namespace seqan3::literals;
int main()
{
std::vector<seqan3::dna4> text{"CCACGTCGACGGTT"_dna4};
// Here a consecutive shape with size 4 (so the k-mer size is 4) and a window size of 8 is used. The seed is set
// to 0, so lexicographical ordering is used for demonstration purposes.
auto minimisers =
text
seqan3::debug_stream << minimisers << '\n';
// This leads to [27,97,26,22,5] representing the k-mers [ACGT, CGAC, ACGG, accg, aacc], smaller case k-mers are
// coming from the reverse strand.
// Here a gapped shape with size 5 (and a k-mer size of 3) and a window size of 8 is used. The seed is set
// to 0, so lexicographical ordering is used for demonstration purposes.
auto minimisers2 = text | seqan3::views::minimiser_hash(0b10101_shape, seqan3::window_size{8}, seqan3::seed{0});
seqan3::debug_stream << minimisers2 << '\n';
// This leads to [9, 18, 7, 6] representing the k-mers [A.G.C, C.A.G, a.c.t, a.c.g]
}
constexpr auto minimiser_hash
Computes minimisers for a range with a given shape, window size and seed.
Definition: minimiser_hash.hpp:193
Provides seqan3::views::minimiser_hash.
strong_type for seed.
Definition: minimiser_hash.hpp:25
strong_type for the window_size.
Definition: minimiser_hash.hpp:32

This entity is experimental and subject to change in the future. Experimental since version 3.1.