SeqAn3 3.4.0-rc.1
The Modern C++ library for sequence analysis.
Loading...
Searching...
No Matches
FM Index

Provides seqan3::fm_index and seqan3::bi_fm_index as well as respective cursors. More...

+ Collaboration diagram for FM Index:

Classes

class  seqan3::bi_fm_index< alphabet_t, text_layout_mode_, sdsl_index_type_ >
 The SeqAn Bidirectional FM Index. More...
 
class  seqan3::bi_fm_index_cursor< index_t >
 The SeqAn Bidirectional FM Index Cursor. More...
 
class  seqan3::fm_index< alphabet_t, text_layout_mode_, sdsl_index_type_ >
 The SeqAn FM Index. More...
 
class  seqan3::fm_index_cursor< index_t >
 The SeqAn FM Index Cursor. More...
 
struct  seqan3::detail::fm_index_cursor_node< index_t >
 Internal representation of the node of an FM index cursor. More...
 
struct  seqan3::detail::fm_index_validator
 Class used to validate the requirements on the input text of the fm_index. More...
 
class  seqan3::detail::reverse_fm_index< alphabet_t, text_layout_mode, sdsl_index_type >
 An FM Index specialisation that handles reversing the given text. More...
 
interface  sdsl_index
 Concept for SDSL FM indices (which are called compressed suffix arrays in the SDSL). More...
 
struct  seqan3::suffix_array_interval
 The underlying suffix array interval. More...
 

Typedefs

using seqan3::default_sdsl_index_type = sdsl_wt_index_type
 The default FM Index Configuration.
 
using seqan3::sdsl_wt_index_type = sdsl::csa_wt< sdsl::wt_blcd< sdsl::bit_vector, sdsl::rank_support_v<>, sdsl::select_support_scan<>, sdsl::select_support_scan< 0 > >, 16, 10 '000 '000, sdsl::sa_order_sa_sampling<>, sdsl::isa_sampling<>, sdsl::plain_byte_alphabet >
 The FM Index Configuration using a Wavelet Tree.
 

Enumerations

enum  seqan3::text_layout : bool { seqan3::single , seqan3::collection }
 The possible text layouts (single, collection) the seqan3::fm_index and seqan3::bi_fm_index can support. More...
 

Detailed Description

Provides seqan3::fm_index and seqan3::bi_fm_index as well as respective cursors.

FM Indices

FM indices are text indices similar to suffix trees or suffix arrays which are based on the Burrow Wheeler transform and a sampled suffix array. FM indices are significantly smaller in space without sacrificing speed. They also allow for adjusting the speed respectively space by choosing the underlying data structures accordingly or modyfing the sampling rate of the sampled suffix array.

The FM indices are based on the SDSL 3 (succinct data structure library). You are able to specify the underlying implementation of the SDSL to adjust it to your needs as well as choose one of the preconfigured indices that are suitable for common applications in sequence analysis.

For technical reasons you can currently only build indices over a seqan3::alphabet if its seqan3::alphabet_size is smaller or equal 256.

You can choose between unidirectional and bidirectional FM indices (which can be thought of suffix trees and affix trees, i.e. a combination of suffix and prefix trees being able to search a pattern from left to right, right to left and character by character in any arbitrary order). Roughly speaking bidirectional FM indices are more powerful for approximate string matching at the cost of a higher space consumption (between a factor of 5 and 9 of the input size depending on the configuration).

FM Index Cursors

Index Cursors are lightweight objects, i.e. they are cheap to copy.

Note that the SeqAn index cursor, although having similar behaviour, doesn't model any of the standard library iterator concepts, not even std::input_or_output_iterator.

Typedef Documentation

◆ default_sdsl_index_type

The default FM Index Configuration.

Attention
The default might be changed in a future release. If you rely on a stable API and on-disk-format, please hard-code your sdsl_index_type to a concrete type.

◆ sdsl_wt_index_type

using seqan3::sdsl_wt_index_type = typedef sdsl::csa_wt<sdsl::wt_blcd<sdsl::bit_vector, sdsl::rank_support_v<>, sdsl::select_support_scan<>, sdsl::select_support_scan<0> >, 16, 10'000'000, sdsl::sa_order_sa_sampling<>, sdsl::isa_sampling<>, sdsl::plain_byte_alphabet>

The FM Index Configuration using a Wavelet Tree.

Running time / Space consumption

\(SAMPLING\_RATE = 16\)
\(\Sigma\): alphabet_size<alphabet_type> where alphabet_type is the seqan3 alphabet type (e.g. seqan3::dna4 has an alphabet size of 4).

For an index over a text collection a delimiter is added in between the texts. This causes sigma to increase by 1.

Attention
For any alphabet, the symbol with rank 255 is not allowed to occur in the text. Additionally, rank 254 cannot occur when indexing text collections.

Rank 255 is generally not allowed. When constructing the index, we increase the rank of every letter by 1, since the SDSL uses 0 as a sentinel. Incrementing 255 by 1 will cause the rank (uint8_t) to overflow to 0. For text collections, 255 is reserved as delimiter between the individual texts. Therefore the letter with rank 254 cannot occur in the text.

This index will only work for byte alphabets, i.e. alphabets with a size <= 256. When switching to bigger alphabets, the static_asserts, the delimiter choice and the transform view for the construction need to be adjusted. Additionally, all occurrences of uint8_t should be double checked to make sure they also apply to bigger alphabets.

\(T_{BACKWARD\_SEARCH}: O(\log \Sigma)\)

Todo:
Asymptotic space consumption:

Enumeration Type Documentation

◆ text_layout

enum seqan3::text_layout : bool

The possible text layouts (single, collection) the seqan3::fm_index and seqan3::bi_fm_index can support.

Enumerator
single 

The text is a single range.

collection 

The text is a range of ranges.

Hide me