Data structures and approximate string search algorithms for large collection of text (e.g. DNA). More...

Collaboration diagram for Search:

Modules
	Configuration
	Data structures and utility functions for configuring search algorithm.

	DREAM Index
	Provides seqan3::interleaved_bloom_filter.

	FM Index
	Provides seqan3::fm_index and seqan3::bi_fm_index as well as respective cursors.

	k-mer Index
	Implementation of shapes for a k-mer Index.

	Views
	Search related views.

Classes
class	seqan3::search_result< query_id_type, cursor_type, reference_id_type, reference_begin_position_type >
	The result class generated by the seqan3::seach algorithm. More...

struct	seqan3::search_result_printer< search_result< specs_t... > >
	The printer used for formatted output of the search result. More...

Functions
template<typename index_t , std::ranges::forward_range queries_t, typename configuration_t = decltype(search_cfg::default_configuration)> requires std::ranges::forward_range<std::ranges::range_reference_t<queries_t>> && std::same_as<range_innermost_value_t<queries_t>, typename index_t::alphabet_type>
auto	seqan3::search (queries_t &&queries, index_t const &index, configuration_t const &cfg=search_cfg::default_configuration)
	Search a query or a range of queries in an index.

Detailed Descriptionno-api

Data structures and approximate string search algorithms for large collection of text (e.g. DNA).

Searching is a key component in many sequence analysis tools. The search module is a powerful and easy way to search sequences in a large text or an arbitrary nested collection of texts. When it comes to searching, indices are a core component for searching large amounts of data and are used for tools such as read mappers, assemblers or protein search tools.

SeqAn currently implements only the FM index and a k-mer index is planned. The FM index works with arbitrary pattern lengths and error numbers.

Search algorithm: seqan3::search

The Search module offers a unified search interface seqan3::search. The function chooses the best search method based on the provided index and an optional configuration.

seqan3::search(queries, index, config);

seqan3::search

auto search(queries_t &&queries, index_t const &index, configuration_t const &cfg=search_cfg::default_configuration)

Search a query or a range of queries in an index.

Definition search.hpp:104

The search configuration

The seqan3::search algorithm can be configured in multiple ways. You simply pass a respective config object to the algorithm:

The search configuration is independent of the index you use. It specifies for example how many mismatches and indels are allowed to consider your search successful.

See detailed documentation on Configuration for details.

The search result

The result is a range over "hits". We call a hit the position of a successful search. A hit in SeqAn is represented by the seqan3::search_result. You can configure what's part of the search result with the 4. Output Configuration.

You can iterate over the results with a normal for loop:

for (auto && result : results)

seqan3::debug_stream << result << '\n';

seqan3

The main SeqAn3 namespace.

Definition aligned_sequence_concept.hpp:26

Which index do I use?

In order to choose the right Index for your use case, have a look at the next section (Available Indices for the seqan3::search). The documentation of the respective index will lead you to a more detailed description on how to use our search algorithm.

Available Indices for the seqan3::search

seqan3::fm_index

The seqan3::fm_index implements an original FM Index with a trivial backtracking approach. It is recommended for searches with no errors (exact searches). For exact searches, the original FM index is slightly faster than the bidirectional FM Index, and in general, it is only half the size.

seqan3::bi_fm_index

The seqan3::bi_fm_index is a bidirectional FM index [1]. It improves the original FM index by allowing to extend the query to the right and the left. This makes the bidirectional FM index more efficient than its predecessor when searching with errors (approximate search). The performance gain is enabled by using (optimum) search schemes. Currently, using search schemes is only supported for searches with up to three errors and falls back to a trivial backtracking approach for higher errors. In the future, we plan to improve the search schemes to handle higher error counts.

Reference:

[1] Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index. bioRxiv, 301085. https://doi.org/10.1101/301085, Kianfar, K., Pockrandt, C., Torkamandi, B., Luo, H., & Reinert, K. (2018).

Membership queries (lookups)

If you are not interested in the exact match of your query, but simply whether your query can be found or not (also called membership query or lookup), SeqAn offers efficient data structures for this purpose. Of course you could also use a simple hash map (e.g., std::unordered_map) but depending on your data this can quickly become infeasible in terms of memory consumption and runtime.

Take a look at the seqan3::interleaved_bloom_filter. The following example shows how to use it to simultaneously query all regions of a genome for the occurrence of a specific query:

// SPDX-FileCopyrightText: 2006-2025 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2025 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
 
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/search/dream_index/interleaved_bloom_filter.hpp>
#include <seqan3/search/views/kmer_hash.hpp>
#include <seqan3/utility/views/chunk.hpp>
 
using namespace seqan3::literals;
 
int main()
{
    auto genome = "TTTTTTTTTTAAAAAAAAAATTTTTTTTTTGGGGGGGGGG"_dna4;
 
    seqan3::interleaved_bloom_filter ibf{seqan3::bin_count{4u}, seqan3::bin_size{8192u}}; // reserve 4 buckets
 
    // Fill buckets of the interleaved bloomm filter
    auto genome_buckets = genome | seqan3::views::chunk(10); // divide genome into buckets of size 10
    size_t bucket_idx{0};
    for (auto bucket : genome_buckets)
    {
        for (auto kmer : bucket | seqan3::views::kmer_hash(seqan3::ungapped{2})) // hash genome with k = 2
            ibf.emplace(kmer, seqan3::bin_index{bucket_idx});
        ++bucket_idx;
    }
 
    auto ibf_agent = ibf.counting_agent(); // the membership_agent enables efficient kemr queries
 
    auto query = "TTT"_dna4;
    auto query_kmers = query | seqan3::views::kmer_hash(seqan3::ungapped{2}); // hash query with k = 2
 
    seqan3::debug_stream << ibf_agent.bulk_count(query_kmers) << '\n'; // prints [2,0,2,0]
}

Function Documentation

◆ search() [no-api]

template<typename index_t , std::ranges::forward_range queries_t, typename configuration_t = decltype(search_cfg::default_configuration)>
requires std::ranges::forward_range<std::ranges::range_reference_t<queries_t>> && std::same_as<range_innermost_value_t<queries_t>, typename index_t::alphabet_type>

auto seqan3::search	(	queries_t &&	queries,
		index_t const &	index,
		configuration_t const &	cfg = `search_cfg::default_configuration`
	)

no-apiinline

Search a query or a range of queries in an index.

Template Parameters

index_t	Type of the index. See Available Indices for the seqan3::search for an overview of our indices.
queries_t	Must model std::ranges::random_access_range over the index's alphabet and std::ranges::sized_range. A range of queries must additionally model std::ranges::forward_range and std::ranges::sized_range.

Parameters

[in]	queries	A single query or a range of queries.
[in]	index	String index to be searched.
[in]	cfg	A configuration object specifying the search parameters (e.g. number of errors, error types, output format, etc.).

Returns: A seqan3::algorithm_result_generator_range with value type of seqan3::search_result.

Header File

#include <seqan3/search/search.hpp>

The search algorithm strongly depends on the index that is used. Please see Available Indices for the seqan3::search for an overview of our indices.

For more details on how to configure the search, please see the respective documentation: Configuration.

Complexity

Each query with $e$ errors takes $O (| q u e r y |^{e})$ where $e$ is the maximum number of errors.

Exceptions

Strong exception guarantee if iterating the query does not change its state and if invoking a possible delegate specified in cfg also has a strong exception guarantee; basic exception guarantee otherwise.

Example

// SPDX-FileCopyrightText: 2006-2025 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2025 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
 
#include <vector>
 
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/search/fm_index/all.hpp>
#include <seqan3/search/search.hpp>
 
int main()
{
    using namespace seqan3::literals;
    std::vector<seqan3::dna4_vector> genomes{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTA"_dna4,
                                             "ACCCGATGAGCTACCCAGTAGTCGAACTG"_dna4,
                                             "GGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
    std::vector<seqan3::dna4_vector> queries{"GCT"_dna4, "ACCC"_dna4};
 
    // build an FM index
    seqan3::fm_index index{genomes};
 
    auto results = seqan3::search(queries, index);
 
    // search for the queries "GCT" and "ACCC"
    for (auto && result : results)
        seqan3::debug_stream << result << '\n';
    // This should result in:
    // <query_id:0, reference_id:0, reference_pos:1>
    // <query_id:0, reference_id:1, reference_pos:9>
    // <query_id:0, reference_id:2, reference_pos:16>
    // <query_id:1, reference_id:1, reference_pos:0>
    // <query_id:1, reference_id:1, reference_pos:12>
    // <query_id:1, reference_id:2, reference_pos:9>
}

Modules

Classes

Functions