SeqAn3 3.4.0-rc.1
The Modern C++ library for sequence analysis.
|
Data structures and approximate string search algorithms for large collection of text (e.g. DNA). More...
Modules | |
Configuration | |
Data structures and utility functions for configuring search algorithm. | |
DREAM Index | |
Provides seqan3::interleaved_bloom_filter. | |
FM Index | |
Provides seqan3::fm_index and seqan3::bi_fm_index as well as respective cursors. | |
k-mer Index | |
Implementation of shapes for a k-mer Index. | |
Views | |
Search related views. | |
Classes | |
struct | seqan3::detail::policy_max_error |
Provides the function max_error_counts if inherited by a search algorithm. More... | |
struct | seqan3::detail::policy_search_result_builder< search_configuration_t > |
Provides the function make_results if inherited by a search algorithm. More... | |
struct | seqan3::detail::search< nbr_blocks > |
Object storing information for a search (of a search scheme). More... | |
struct | seqan3::detail::search_configuration_validator |
Class used to validate the search configuration. More... | |
class | seqan3::detail::search_configurator |
Class used to update the search configuration, e.g. add defaults. More... | |
struct | seqan3::detail::search_dyn |
Object storing information for a search (of a search scheme). More... | |
struct | seqan3::detail::search_param |
Object grouping numbers of errors for different kind of error types. More... | |
class | seqan3::search_result< query_id_type, cursor_type, reference_id_type, reference_begin_position_type > |
The result class generated by the seqan3::seach algorithm. More... | |
class | seqan3::detail::search_scheme_algorithm< configuration_t, index_t, policies_t > |
The algorithm that performs a bidirectional search on a bidirectional FM index using (optimal) search schemes. More... | |
struct | seqan3::detail::search_traits< search_configuration_t > |
A collection of traits extracted from the search configuration. More... | |
class | seqan3::detail::unidirectional_search_algorithm< configuration_t, index_t, policies_t > |
The algorithm that performs a unidirectional search on an FM index using trivial backtracking. More... | |
Typedefs | |
using | seqan3::detail::search_scheme_dyn_type = std::vector< search_dyn > |
Type for storing search schemes. Number of blocks do not have to be known at compile time. | |
template<uint8_t nbr_searches, uint8_t nbr_blocks> | |
using | seqan3::detail::search_scheme_type = std::array< search< nbr_blocks >, nbr_searches > |
Type for storing search schemes. Number of blocks have to be known at compile time. | |
Enumerations | |
enum class | seqan3::detail::error_type : uint8_t { seqan3::detail::error_type::deletion , seqan3::detail::error_type::insertion , seqan3::detail::error_type::matchmm , seqan3::detail::error_type::none } |
An enumerator for the different error types used during the backtracking. More... | |
Functions | |
std::vector< search_dyn > | seqan3::detail::compute_ss (uint8_t const min_error, uint8_t const max_error) |
Computes a (non-optimal) search scheme. Currently the generated search scheme represents trivial backtracking. | |
template<typename index_t , std::ranges::forward_range queries_t, typename configuration_t = decltype(search_cfg::default_configuration)> requires std::ranges::forward_range<std::ranges::range_reference_t<queries_t>> && std::same_as<range_innermost_value_t<queries_t>, typename index_t::alphabet_type> | |
auto | seqan3::search (queries_t &&queries, index_t const &index, configuration_t const &cfg=search_cfg::default_configuration) |
Search a query or a range of queries in an index. | |
template<bool abort_on_hit, typename query_t , typename delegate_t > requires (template_specialisation_of<typename index_t::cursor_type, bi_fm_index_cursor>) | |
void | seqan3::detail::search_scheme_algorithm< configuration_t, index_t, policies_t >::search_algo_bi (query_t &query, search_param const error_left, delegate_t &&delegate) |
Searches a query sequence in a bidirectional index. | |
template<typename search_scheme_t > | |
auto | seqan3::detail::search_scheme_block_info (search_scheme_t const &search_scheme, size_t const query_length) |
Returns for each search the cumulative length of blocks in the order of blocks in each search and the starting position of the first block in the query sequence. | |
template<bool abort_on_hit, typename cursor_t , typename query_t , typename search_t , typename blocks_length_t , typename delegate_t > | |
bool | seqan3::detail::search_ss (cursor_t cur, query_t &query, typename cursor_t::size_type const lb, typename cursor_t::size_type const rb, uint8_t const errors_spent, uint8_t const block_id, bool const go_right, search_t const &search, blocks_length_t const &blocks_length, search_param const error_left, delegate_t &&delegate) |
Searches a query sequence in a bidirectional index using a single search of a search schemes. | |
template<bool abort_on_hit, typename index_t , typename query_t , typename search_scheme_t , typename delegate_t > | |
void | seqan3::detail::search_ss (index_t const &index, query_t &query, search_param const error_left, search_scheme_t const &search_scheme, delegate_t &&delegate) |
Searches a query sequence in a bidirectional index using search schemes. | |
template<bool abort_on_hit, typename cursor_t , typename query_t , typename search_t , typename blocks_length_t , typename delegate_t > | |
bool | seqan3::detail::search_ss_children (cursor_t cur, query_t &query, typename cursor_t::size_type const lb, typename cursor_t::size_type const rb, uint8_t const errors_spent, uint8_t const block_id, bool const go_right, uint8_t const min_error_left_in_block, search_t const &search, blocks_length_t const &blocks_length, search_param const error_left, delegate_t &&delegate) |
Searches a query sequence in a bidirectional index using a single search of a search schemes. Sub-function for approximate search step (iterating over all children in a conceptual suffix tree). | |
template<bool abort_on_hit, typename cursor_t , typename query_t , typename search_t , typename blocks_length_t , typename delegate_t > | |
bool | seqan3::detail::search_ss_deletion (cursor_t cur, query_t &query, typename cursor_t::size_type const lb, typename cursor_t::size_type const rb, uint8_t const errors_spent, uint8_t const block_id, bool const go_right, search_t const &search, blocks_length_t const &blocks_length, search_param const error_left, delegate_t &&delegate) |
Searches a query sequence in a bidirectional index using a single search of a search schemes. Sub-function for deletions at the end of a block. | |
template<bool abort_on_hit, typename cursor_t , typename query_t , typename search_t , typename blocks_length_t , typename delegate_t > | |
bool | seqan3::detail::search_ss_exact (cursor_t cur, query_t &query, typename cursor_t::size_type const lb, typename cursor_t::size_type const rb, uint8_t const errors_spent, uint8_t const block_id, bool const go_right, search_t const &search, blocks_length_t const &blocks_length, search_param const error_left, delegate_t &&delegate) |
Searches a query sequence in a bidirectional index using a single search of a search scheme. Sub-function for searching the remaining part of the current block without any errors. | |
template<bool abort_on_hit, typename query_t > | |
bool | seqan3::detail::unidirectional_search_algorithm< configuration_t, index_t, policies_t >::search_trivial (typename index_t::cursor_type cur, query_t &query, typename index_t::cursor_type::size_type const query_pos, search_param const error_left, error_type const prev_error) |
Searches a query sequence in an index using trivial backtracking. | |
Variables | |
template<uint8_t min_error, uint8_t max_error> | |
constexpr int | seqan3::detail::optimum_search_scheme {0} |
Search scheme that is optimal in the running time for the specified lower and upper error bound. | |
Data structures and approximate string search algorithms for large collection of text (e.g. DNA).
Searching is a key component in many sequence analysis tools. The search module is a powerful and easy way to search sequences in a large text or an arbitrary nested collection of texts. When it comes to searching, indices are a core component for searching large amounts of data and are used for tools such as read mappers, assemblers or protein search tools.
SeqAn currently implements only the FM index and a k-mer index is planned. The FM index works with arbitrary pattern lengths and error numbers.
Elaborate on that (space consumption for growing k, maybe a rule of thumb).
Add a description of the DREAM index here.
The Search module offers a unified search interface seqan3::search. The function chooses the best search method based on the provided index and an optional configuration.
The seqan3::search
algorithm can be configured in multiple ways. You simply pass a respective config
object to the algorithm:
The search configuration is independent of the index you use. It specifies for example how many mismatches and indels are allowed to consider your search successful.
See detailed documentation on Configuration for details.
The result is a range over "hits". We call a hit the position of a successful search. A hit in SeqAn is represented by the seqan3::search_result. You can configure what's part of the search result with the 4. Output Configuration.
You can iterate over the results with a normal for loop:
In order to choose the right Index for your use case, have a look at the next section (Available Indices for the seqan3::search). The documentation of the respective index will lead you to a more detailed description on how to use our search algorithm.
The seqan3::fm_index implements an original FM Index with a trivial backtracking approach. It is recommended for searches with no errors (exact searches). For exact searches, the original FM index is slightly faster than the bidirectional FM Index, and in general, it is only half the size.
The seqan3::bi_fm_index is a bidirectional FM index [1]. It improves the original FM index by allowing to extend the query to the right and the left. This makes the bidirectional FM index more efficient than its predecessor when searching with errors (approximate search). The performance gain is enabled by using (optimum) search schemes. Currently, using search schemes is only supported for searches with up to three errors and falls back to a trivial backtracking approach for higher errors. In the future, we plan to improve the search schemes to handle higher error counts.
Reference:
[1] Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index. bioRxiv, 301085. https://doi.org/10.1101/301085, Kianfar, K., Pockrandt, C., Torkamandi, B., Luo, H., & Reinert, K. (2018).
A search scheme is a collection of searches, where each search s
specifies the order in which the blocks are searched (s.pi
), the lower error bounds (s.l
) and the upper error bounds (s.u
). If the number of blocks that the query sequences are split into are known at compile time, the data structure seqan3::detail::search is recommended, otherwise one has to use seqan3::detail::search_dyn. The first one implements its member variables as an a std::array
of integers, the latter as a std::vector
of integers. Search search schemes are defined similiar. They are either implemented as a std::array
of searches if the number of searches is known at compile time, or as a std::vector
if not.
Precomputed optimum search schemes are represented as a std::array
of seqan3::detail::search since both the number of searches and the number of blocks are known at compile time. Search schemes computed at run time are represented as std::vector
of seqan3::detail::search_dyn.
All optimum search schemes are disjoint, i.e. no error configuration is covered by more than one search. Thus there is no need for a filtration phase after searching to remove duplicate hits when searching with hamming distance. Allowing insertions and deletions will, however, lead to redundant reports of hits regardless of search schemes.
If you are not interested in the exact match of your query, but simply whether your query can be found or not (also called membership query or lookup), SeqAn offers efficient data structures for this purpose. Of course you could also use a simple hash map (e.g., std::unordered_map
) but depending on your data this can quickly become infeasible in terms of memory consumption and runtime.
Take a look at the seqan3::interleaved_bloom_filter. The following example shows how to use it to simultaneously query all regions of a genome for the occurrence of a specific query:
|
strong |
An enumerator for the different error types used during the backtracking.
|
inline |
Computes a (non-optimal) search scheme. Currently the generated search scheme represents trivial backtracking.
[in] | min_error | Minimum number of errors allowed. |
[in] | max_error | Maximum number of errors allowed. |
Constant.
Strong exception guarantee.
|
inline |
Search a query or a range of queries in an index.
index_t | Type of the index. See Available Indices for the seqan3::search for an overview of our indices. |
queries_t | Must model std::ranges::random_access_range over the index's alphabet and std::ranges::sized_range. A range of queries must additionally model std::ranges::forward_range and std::ranges::sized_range. |
[in] | queries | A single query or a range of queries. |
[in] | index | String index to be searched. |
[in] | cfg | A configuration object specifying the search parameters (e.g. number of errors, error types, output format, etc.). |
void
if an on_hit delegate has been specified. Header File
#include <seqan3/search/search.hpp>
The search algorithm strongly depends on the index that is used. Please see Available Indices for the seqan3::search for an overview of our indices.
For more details on how to configure the search, please see the respective documentation: Configuration.
Each query with \(e\) errors takes \(O(|query|^e)\) where \(e\) is the maximum number of errors.
Strong exception guarantee if iterating the query does not change its state and if invoking a possible delegate specified in cfg
also has a strong exception guarantee; basic exception guarantee otherwise.
|
inlineprivate |
Searches a query sequence in a bidirectional index.
abort_on_hit | If the flag is set, the search aborts on the first hit. |
query_t | Must model std::ranges::random_access_range over the index's alphabet. |
delegate_t | Takes typename index_t::cursor_type as argument. |
[in] | query | Query sequence to be searched in the index. |
[in] | error_left | Number of errors left for matching the remaining suffix of the query sequence. |
[in] | delegate | Function that is called on every hit. |
\(O(|query|^e)\) where \(e\) is the total number of maximum errors.
Strong exception guarantee if iterating the query does not change its state and if invoking the delegate also has a strong exception guarantee; basic exception guarantee otherwise.
|
inline |
Returns for each search the cumulative length of blocks in the order of blocks in each search and the starting position of the first block in the query sequence.
search_scheme_t | Is of type seqan3::detail::search_scheme_type or seqan3::detail::search_scheme_dyn_type . |
[in] | search_scheme | Search scheme that will be used for searching. |
[in] | query_length | Length of the query that will be searched in an index. |
Constant.
Strong exception guarantee.
|
inline |
Searches a query sequence in a bidirectional index using a single search of a search schemes.
abort_on_hit | If the flag is set, the search aborts on the first hit. |
cursor_t | Must model seqan3::detail::template_specialisation_of a seqan3::bi_fm_index_cursor. |
query_t | Must model std::ranges::random_access_range over the index's alphabet. |
search_t | Is of type seqan3::detail::search<> or seqan3::detail::search_dyn<> . |
blocks_length_t | Is of type std::array or std::vector of unsigned integers. |
delegate_t | Takes cursor_t as argument. |
[in] | cur | Cursor of a string index built on the text that will be searched. |
[in] | query | Query sequence to be searched. |
[in] | lb | Left bound of the infix of query already searched (exclusive). |
[in] | rb | Right bound of the infix of query already searched (exclusive). |
[in] | errors_spent | Number of errors spent while searching the infix of query . |
[in] | block_id | Id of the block that the infix is extended to next. |
[in] | go_right | The infix will be extended to the right if the flag is set to true. |
[in] | search | Search of a search scheme to be used for searching. |
[in] | blocks_length | Cumulative block lengths of the search. |
[in] | error_left | Number of errors left for matching the remaining suffix of the query sequence. |
[in] | delegate | Function that is called on every hit. |
True
if and only if abort_on_hit
is true and a hit has been found.\(O(|query|^e)\) where \(e\) is the total number of errors allowed by search
.
Strong exception guarantee if iterating the query does not change its state and if invoking the delegate also has a strong exception guarantee; basic exception guarantee otherwise.
|
inline |
Searches a query sequence in a bidirectional index using search schemes.
abort_on_hit | If the flag is set, the search aborts on the first hit. |
index_t | index_t::cursor_type must model seqan3::detail::template_specialisation_of a seqan3::bi_fm_index_cursor. |
query_t | Must model std::ranges::random_access_range over the index's alphabet. |
search_scheme_t | Is of type seqan3::detail::search_scheme_type or seqan3::detail::search_scheme_dyn_type . |
delegate_t | Takes typename index_t::cursor_type as argument. |
[in] | index | String index built on the text that will be searched. |
[in] | query | Query sequence to be searched in the index. |
[in] | error_left | Number of errors left for matching the remaining suffix of the query sequence. |
[in] | search_scheme | Search scheme to be used for searching. |
[in] | delegate | Function that is called on every hit. |
\(O(|query|^e)\) where \(e\) is the total number of maximum errors.
Strong exception guarantee if iterating the query does not change its state and if invoking the delegate also has a strong exception guarantee; basic exception guarantee otherwise.
|
inline |
Searches a query sequence in a bidirectional index using a single search of a search schemes. Sub-function for approximate search step (iterating over all children in a conceptual suffix tree).
abort_on_hit | If the flag is set, the search aborts on the first hit. |
cursor_t | Must model seqan3::detail::template_specialisation_of a seqan3::bi_fm_index_cursor. |
query_t | Must model std::ranges::random_access_range over the index's alphabet. |
search_t | Is of type seqan3::detail::search<> or seqan3::detail::search_dyn<> . |
blocks_length_t | Is of type std::array or std::vector of unsigned integers. |
delegate_t | Takes cursor_t as argument. |
[in] | cur | Cursor of a string index built on the text that will be searched. |
[in] | query | Query sequence to be searched. |
[in] | lb | Left bound of the infix of query already searched (exclusive). |
[in] | rb | Right bound of the infix of query already searched (exclusive). |
[in] | errors_spent | Number of errors spent while searching the infix of query . |
[in] | block_id | Id of the block that the infix is extended to next. |
[in] | go_right | The infix will be extended to the right if the flag is set to true. |
[in] | search | Search of a search scheme to be used for searching. |
[in] | blocks_length | Cumulative block lengths of the search. |
[in] | error_left | Number of errors left for matching the remaining suffix of the query sequence. |
[in] | delegate | Function that is called on every hit. |
True
if and only if abort_on_hit
is true and a hit has been found.\(O(|query|^e)\) where \(e\) is the total number of errors allowed by search
.
Strong exception guarantee if iterating the query does not change its state and if invoking the delegate also has a strong exception guarantee; basic exception guarantee otherwise.
[in] | min_error_left_in_block | Number of remaining errors that need to be spent in the current block. |
|
inline |
Searches a query sequence in a bidirectional index using a single search of a search schemes. Sub-function for deletions at the end of a block.
abort_on_hit | If the flag is set, the search aborts on the first hit. |
cursor_t | Must model seqan3::detail::template_specialisation_of a seqan3::bi_fm_index_cursor. |
query_t | Must model std::ranges::random_access_range over the index's alphabet. |
search_t | Is of type seqan3::detail::search<> or seqan3::detail::search_dyn<> . |
blocks_length_t | Is of type std::array or std::vector of unsigned integers. |
delegate_t | Takes cursor_t as argument. |
[in] | cur | Cursor of a string index built on the text that will be searched. |
[in] | query | Query sequence to be searched. |
[in] | lb | Left bound of the infix of query already searched (exclusive). |
[in] | rb | Right bound of the infix of query already searched (exclusive). |
[in] | errors_spent | Number of errors spent while searching the infix of query . |
[in] | block_id | Id of the block that the infix is extended to next. |
[in] | go_right | The infix will be extended to the right if the flag is set to true. |
[in] | search | Search of a search scheme to be used for searching. |
[in] | blocks_length | Cumulative block lengths of the search. |
[in] | error_left | Number of errors left for matching the remaining suffix of the query sequence. |
[in] | delegate | Function that is called on every hit. |
True
if and only if abort_on_hit
is true and a hit has been found.\(O(|query|^e)\) where \(e\) is the total number of errors allowed by search
.
Strong exception guarantee if iterating the query does not change its state and if invoking the delegate also has a strong exception guarantee; basic exception guarantee otherwise.
|
inline |
Searches a query sequence in a bidirectional index using a single search of a search scheme. Sub-function for searching the remaining part of the current block without any errors.
abort_on_hit | If the flag is set, the search aborts on the first hit. |
cursor_t | Must model seqan3::detail::template_specialisation_of a seqan3::bi_fm_index_cursor. |
query_t | Must model std::ranges::random_access_range over the index's alphabet. |
search_t | Is of type seqan3::detail::search<> or seqan3::detail::search_dyn<> . |
blocks_length_t | Is of type std::array or std::vector of unsigned integers. |
delegate_t | Takes cursor_t as argument. |
[in] | cur | Cursor of a string index built on the text that will be searched. |
[in] | query | Query sequence to be searched. |
[in] | lb | Left bound of the infix of query already searched (exclusive). |
[in] | rb | Right bound of the infix of query already searched (exclusive). |
[in] | errors_spent | Number of errors spent while searching the infix of query . |
[in] | block_id | Id of the block that the infix is extended to next. |
[in] | go_right | The infix will be extended to the right if the flag is set to true. |
[in] | search | Search of a search scheme to be used for searching. |
[in] | blocks_length | Cumulative block lengths of the search. |
[in] | error_left | Number of errors left for matching the remaining suffix of the query sequence. |
[in] | delegate | Function that is called on every hit. |
True
if and only if abort_on_hit
is true and a hit has been found.\(O(|query|^e)\) where \(e\) is the total number of errors allowed by search
.
Strong exception guarantee if iterating the query does not change its state and if invoking the delegate also has a strong exception guarantee; basic exception guarantee otherwise.
|
inlineprivate |
Searches a query sequence in an index using trivial backtracking.
abort_on_hit | If the flag is set, the search algorithm aborts on the first hit. |
query_t | Must model std::ranges::input_range over the index's alphabet. |
[in] | cur | Cursor of a string index built on the text that will be searched. |
[in] | query | Query sequence to be searched with the cursor. |
[in] | query_pos | Position in the query sequence indicating the prefix that has already been searched. |
[in] | error_left | Number of errors left for matching the remaining suffix of the query sequence. |
[in] | prev_error | Previous scenario of search, i.e. error or match. |
True
if and only if abort_on_hit
is true
and a hit has been found.\(O(|query|^e)\) where \(e\) is the maximum number of errors.
No-throw guarantee if invoking the delegate also guarantees no-throw.
|
inlineconstexpr |
Search scheme that is optimal in the running time for the specified lower and upper error bound.
min_error | Lower bound of errors. |
max_error | Upper bound of errors. |
Please note that the searches within each search scheme are sorted by their asymptotical run time (i.e. upper error bound string), s.t. easy to compute searches come first. This improves the run time of algorithms that abort after the first hit (e.g. search mode: best). Even though it is not guaranteed, this seems to be a good greedy approach.