Introduction

Exact and approximate string matching is a common problem in bioinformatics, e.g. in read mapping. Usually, we want to search for a hit of one or many small sequences (queries) in a database consisting of one or more large sequences (references). Trivial approaches such as on-line search fail at this task for large data due to prohibitive run times.

The general solution is to use a data structure called index. An index is built over the reference and only needs to be computed once for a given data set. It is used to speed up the search in the reference and can be re-used by storing it to disk and loading it again without recomputation.

There are two groups of indices: hash tables and suffix-based indices. SeqAn implements the FM-Index and Bi-FM-Index as suffix-based indices. While the Bi-FM-Index needs more space (a factor of about 2), it allows faster searches.

Given an index, SeqAn will choose the best search strategy for you. Since very different algorithms may be selected internally depending on the configuration, it is advisable to do benchmarks with your application. A rule of thumb is to use the Bi-FM-Index when allowing more than 2 errors.

This tutorial will show you how to use the seqan3::fm_index and seqan3::bi_fm_index to create indices and how to search them efficiently using seqan3::search.

Capabilities

With this module you can:

Create, store and load (Bi)-FM-Indices
Search for exact hits
Search for approximate hits (allowing substitutions and indels)

The results of the search can be passed on to other modules, e.g. to create an alignment.

Terminology

Reference

A reference is the data you want to search in, e.g. a genome or protein database.

Query

A query is the data you want to search for in the reference, e.g. a read or an amino acid sequence.

Index

An index is a data structure built over the reference that allows fast searches.

FM-Index

The full-text minute index (FM-Index) is an index that is similar to a suffix tree, but much smaller. It is used by most state-of-the-art read mappers and aligners. You can find more information on FM-Indicies in the original publication and on Wikipedia.

Bi-FM-Index

The bidirectional FM-Index (Bi-FM-Index) is an extension of the FM-Index that enables faster searches, especially when allowing multiple errors. But it uses almost twice the amount of memory the FM-Index uses. You can find more information on Bi-FM-Indicies here.

Example

Constructing a (Bi-)FM-Index is very simple:

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
 
#include <seqan3/search/fm_index/bi_fm_index.hpp> // for using the bi_fm_index
#include <seqan3/search/fm_index/fm_index.hpp>    // for using the fm_index
 
int main()
{
    std::string text{"Garfield the fat cat without a hat."};
    seqan3::fm_index index{text};       // unidirectional index on single text
    seqan3::bi_fm_index bi_index{text}; // bidirectional index on single text
}

You can also index text collections (e.g. genomes with multiple chromosomes or protein databases):

        std::vector<std::string> text{{"Garfield the fat cat without a hat."},
                                      {"He is infinite, he is eternal."},
                                      {"Yet another text I have to think about."}};
        seqan3::fm_index index{text};
        seqan3::bi_fm_index bi_index{text};

The indices can also be stored and loaded from disk by using cereal.

#    include <fstream> // for writing/reading files
 
#    include <cereal/archives/binary.hpp> // for storing/loading indices via cereal
 
        std::string text{"Garfield the fat cat without a hat."};
        seqan3::fm_index index{text};
        {
            std::ofstream os{"index.file", std::ios::binary};
            cereal::BinaryOutputArchive oarchive{os};
            oarchive(index);
        }

        // we need to tell the index that we work on a single text and a `char` alphabet before loading
        seqan3::fm_index<char, seqan3::text_layout::single> index;
        {
            std::ifstream is{"index.file", std::ios::binary};
            cereal::BinaryInputArchive iarchive{is};
            iarchive(index);
        }

Note that in contrast to the construction via a given text, the template cannot be deduced by the compiler when using the default constructor so you have to provide template arguments.

Assignment 1

You are given the text

dna4_vector text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};

Create a seqan3::fm_index over the reference, store the index and load the index into a new seqan3::fm_index object. Print whether the indices are identical or differ.

Solution

#    include <fstream>
 
#    include <seqan3/search/fm_index/fm_index.hpp>
 
#    include <cereal/archives/binary.hpp>
 
int main()
{
    using namespace seqan3::literals;
 
    seqan3::dna4_vector text{
        "CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
    seqan3::fm_index index{text};
 
    {
        std::ofstream os{"index.file", std::ios::binary};
        cereal::BinaryOutputArchive oarchive{os};
        oarchive(index);
    }
 
    // we need to tell the index that we work on a single text and a `dna4` alphabet before loading
    seqan3::fm_index<seqan3::dna4, seqan3::text_layout::single> index2;
    {
        std::ifstream is{"index.file", std::ios::binary};
        cereal::BinaryInputArchive iarchive{is};
        iarchive(index2);
    }
 
    if (index == index2)
        std::cout << "The indices are identical!\n";
    else
        std::cout << "The indices differ!\n";
}

Expected output:

The indices are identical!

Search

Using an index, we can now conduct searches for a given query. In this part, we will learn how to search exactly, allow substitutions and indels, and how to configure what kind of results we want, e.g. all results vs. only the best result.

Terminology

Exact search: Finds all locations of the query in the reference without any errors.

Approximate search: Finds all locations of the query in the reference with substitutions and indels within the confined set of maximal allowed errors.

Hit: A single result that identifies a specific location in the reference where a particular query can be found by either an exact or an approximate search.

Searching for exact hits

We can search for all exact hits using seqan3::search:

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
 
#include <seqan3/core/debug_stream.hpp> // pretty printing
#include <seqan3/search/fm_index/fm_index.hpp>
#include <seqan3/search/search.hpp>
 
using namespace std::string_literals; // for using the ""s string literal
 
int main()
{
    std::string text{"Garfield the fat cat without a hat."};
    seqan3::fm_index index{text};
    seqan3::debug_stream << search("cat"s, index) << '\n'; // [<query_id:0, reference_id:0, reference_pos:17>]
}

You can also pass multiple queries at the same time:

        std::string text{"Garfield the fat cat without a hat."};
        seqan3::fm_index index{text};
        std::vector<std::string> query{"cat"s, "hat"s};
        seqan3::debug_stream << search(query, index) << '\n';
        // prints: [<query_id:0, reference_id:0, reference_pos:17>,
        //          <query_id:1, reference_id:0, reference_pos:31>]

The returned result is a lazy range over individual results, where each entry represents a specific location within the reference sequence for a particular query.

Assignment 2

Search for all exact occurrences of GCT in the text from assignment 1.
Print the number of hits and their positions within the reference sequence.
Do the same for the following text collection:

std::vector<dna4_vector> text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTA"_dna4,
                              "ACCCGATGAGCTACCCAGTAGTCGAACTG"_dna4,
                              "GGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};

Solution

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
 
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/search/fm_index/fm_index.hpp>
#include <seqan3/search/search.hpp>
 
using namespace seqan3::literals;
 
void run_text_single()
{
    seqan3::dna4_vector text{
        "CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
    seqan3::fm_index index{text};
 
    seqan3::debug_stream << "=====   Running on a single text   =====\n"
                         << "The following hits were found:\n";
 
    for (auto && result : search("GCT"_dna4, index))
        seqan3::debug_stream << result << '\n';
}
 
void run_text_collection()
{
    std::vector<seqan3::dna4_vector> text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTA"_dna4,
                                          "ACCCGATGAGCTACCCAGTAGTCGAACTG"_dna4,
                                          "GGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
    seqan3::fm_index index{text};
 
    seqan3::debug_stream << "===== Running on a text collection =====\n"
                         << "The following hits were found:\n";
 
    for (auto && result : search("GCT"_dna4, index))
        seqan3::debug_stream << result << '\n';
}
 
int main()
{
    run_text_single();
    seqan3::debug_stream << '\n';
    run_text_collection();
}

Expected output:

=====   Running on a single text   =====
There are 3 hits.
The following hits were found:
<query_id:0, reference_id:0, reference_pos:1>
<query_id:0, reference_id:0, reference_pos:41>
<query_id:0, reference_id:0, reference_pos:77>
 
===== Running on a text collection =====
There are 3 hits.
The following hits were found:
<query_id:0, reference_id:0, reference_pos:1>
<query_id:0, reference_id:1, reference_pos:9>
<query_id:0, reference_id:2, reference_pos:16>

Searching for approximate hits

Up until now, we have seen that we can call the search with the query sequences and the index. In addition, we can provide a third parameter to provide a user defined search configuration. If we do not provide a user defined search configuration, the seqan3::search_cfg::default_configuration will be used, which triggers an exact search, finding all hits for a particular query. In the following, we will see how we can change the behaviour of the search algorithm by providing a user defined search configuration.

Max error configuration

You can specify the error configuration for the approximate search using the seqan3::search_cfg::max_error_* configuration. Here you can use a combination of the following configuration elements to specify exactly how the errors can be distributed during the search:

Each of the configuration elements can be constructed with either an absolute number of errors or an error rate depending on the context. These are represented by the following types:

seqan3::search_cfg::error_count: Absolute number of errors
seqan3::search_cfg::error_rate: Rate of errors \(\in[0,1]\)

By combining the different error types using the |-operator, we give you full control over the error distribution. Thus, it is possible to set an upper limit of allowed errors but also to refine the error distribution by specifying the allowed errors for substitutions, insertions, or deletions.

Note: When using >= 2 errors it is advisable to use a Bi-FM-Index since searches will be faster.

For example, to search for either 1 insertion or 1 deletion you can write:

        std::string text{"Garfield the fat cat without a hat."};
        seqan3::fm_index index{text};
        seqan3::configuration const cfg = seqan3::search_cfg::max_error_total{seqan3::search_cfg::error_count{1}}
                                        | seqan3::search_cfg::max_error_substitution{seqan3::search_cfg::error_count{0}}
                                        | seqan3::search_cfg::max_error_insertion{seqan3::search_cfg::error_count{1}}
                                        | seqan3::search_cfg::max_error_deletion{seqan3::search_cfg::error_count{1}};
        seqan3::debug_stream << search("cat"s, index, cfg) << '\n';
        // prints: [<query_id:0, reference_id:0, reference_pos:14>,
        //          <query_id:0, reference_id:0, reference_pos:17>,
        //          <query_id:0, reference_id:0, reference_pos:18>,
        //          <query_id:0, reference_id:0, reference_pos:32>]

Here, we restrict the approximate search to only allow one error. This can be then either an insertion or a deletion but not both, since that would exceed the total error limit. This basically means that the error counts/rates do not have to sum up to the total of errors allowed:

        seqan3::configuration const cfg = seqan3::search_cfg::max_error_total{seqan3::search_cfg::error_count{2}}
                                        | seqan3::search_cfg::max_error_substitution{seqan3::search_cfg::error_count{2}}
                                        | seqan3::search_cfg::max_error_insertion{seqan3::search_cfg::error_count{1}}
                                        | seqan3::search_cfg::max_error_deletion{seqan3::search_cfg::error_count{1}};

In the above example, we allow 2 errors, which can be any combination of 2 substitutions, 1 insertion and 1 deletion. Defining only the total will set all error types to this value, i.e. if the total error is set to an error count of 2, any combination of 2 substitutions, 2 insertions and 2 deletions is allowed. On the other hand, when defining any of the error types but no total, the total will be set to the sum of all error types. For example, if we would not specify a total error of 1 in the first example above, the total error would be set to 2 automatically. Hence, the search will also include approximate hits containing one insertion and one deletion.

Assignment 3

Search for all occurrences of GCT in the text from assignment 1.

Allow up to 1 substitution and print all occurrences.

Solution

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
 
#include <span>
 
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/search/fm_index/fm_index.hpp>
#include <seqan3/search/search.hpp>
 
int main()
{
    using namespace seqan3::literals;
 
    seqan3::dna4_vector text{
        "CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
    seqan3::fm_index index{text};
 
    seqan3::configuration const cfg = seqan3::search_cfg::max_error_substitution{seqan3::search_cfg::error_count{1}};
 
    seqan3::debug_stream << search("GCT"_dna4, index, cfg) << '\n';
}

Expected output:

[<query_id:0, reference_id:0, reference_pos:1>,
 <query_id:0, reference_id:0, reference_pos:5>,
 <query_id:0, reference_id:0, reference_pos:12>,
 <query_id:0, reference_id:0, reference_pos:23>,
 <query_id:0, reference_id:0, reference_pos:36>,
 <query_id:0, reference_id:0, reference_pos:41>,
 <query_id:0, reference_id:0, reference_pos:57>,
 <query_id:0, reference_id:0, reference_pos:62>,
 <query_id:0, reference_id:0, reference_pos:75>,
 <query_id:0, reference_id:0, reference_pos:77>,
 <query_id:0, reference_id:0, reference_pos:83>,
 <query_id:0, reference_id:0, reference_pos:85>]

Which hits are reported?

Besides the max error configuration, you can specify the scope of the search algorithm. This means that you can control which hits should be reported by the search.

Hit configuration

To do so, you can use one of the following seqan3::search_cfg::hit_* configurations:

seqan3::search_cfg::hit_all: Report all hits that satisfy the (approximate) search.
seqan3::search_cfg::hit_single_best: Report the best hit, i.e. the first hit with the lowest edit distance.
seqan3::search_cfg::hit_all_best: Report all hits with the lowest edit distance.
seqan3::search_cfg::hit_strata: best+x strategy. Report all hits within the x-neighbourhood of the best hit.

In contrast to the max error configuration, which allows a combination of the different error configuration objects, the hit configuration can only exist once within one search configuration. Trying to specify more than one hit configuration in one search configuration will fail at compile time with a static assertion. Sometimes the program you write requires to choose between different hit configurations depending on a user given program argument at runtime. To handle such cases you can also use the dynamic configuration seqan3::search_cfg::hit. This configuration object represents one of the four hit configurations mentioned previously and can be modified at runtime. The following snippet gives an example for this scenario:

    seqan3::search_cfg::hit hit_dynamic{seqan3::search_cfg::hit_all{}}; // Initialise with hit_all configuration.
 
    bool hit_with_strata = static_cast<bool>(std::rand() & 1); // Either false or true.
    if (hit_with_strata)
        hit_dynamic = seqan3::search_cfg::hit_strata{2}; // Search instead with strata mode.
 
    seqan3::configuration const cfg = seqan3::search_cfg::max_error_total{seqan3::search_cfg::error_count{2}}
                                    | seqan3::search_cfg::max_error_substitution{seqan3::search_cfg::error_count{0}}
                                    | seqan3::search_cfg::max_error_insertion{seqan3::search_cfg::error_count{1}}
                                    | seqan3::search_cfg::max_error_deletion{seqan3::search_cfg::error_count{1}}
                                    | hit_dynamic; // Build the configuration by adding the dynamic hit configuration.

Note that the same rules apply to both the dynamic and static hit configuration. That is, it can be added via the |-operator to the search configuration but cannot be combined with any other hit configuration.

A closer look at the strata configuration reveals that it is initialised with an additional parameter called the stratum. The stratum can be modified even after it was added to the search configuration like the following example demonstrates:

        seqan3::configuration cfg = seqan3::search_cfg::max_error_total{seqan3::search_cfg::error_count{2}}
                                  | seqan3::search_cfg::max_error_substitution{seqan3::search_cfg::error_count{0}}
                                  | seqan3::search_cfg::max_error_insertion{seqan3::search_cfg::error_count{1}}
                                  | seqan3::search_cfg::max_error_deletion{seqan3::search_cfg::error_count{1}}
                                  | seqan3::search_cfg::hit_strata{2};
        using seqan3::get;                                    // Required to be able to find the correct get function.
        get<seqan3::search_cfg::hit_strata>(cfg).stratum = 1; // The stratum is now 1 and not 2 anymore.

Here we introduced a new concept when working with the seqan3::configuration object, which is much like the access interface of a std::tuple. Concretely, it is possible to access the stored configuration using the get<cfg_type>(cfg) interface, where cfg_type is the name of the configuration type we would like to access. The get interface returns a reference to the stored object that is identified by the given name. If you try to access an object which does not exist within the search configuration, a static assert will be emitted at compile time such that no invalid code can be generated.

Note: We need to use the expression using seqan3::get; before we can call the get interface in order to allow the compiler to find the correct implementation of it based on the passed argument. This is related to how C++ resolves unqualified lookup of free functions in combination with function templates using an explicit template argument such as the get interface does.

So, the open question remains what the stratum actually does. In the above example, if the best hit found by the search for a particular query had an edit distance of 1, the strata strategy would report all hits with up to an edit distance of 2. Since in this example the total error number is set to 2, all hits with 1 or 2 errors would be reported.

Assignment 4

Search for all occurrences of GCT in the text from assignment 1.
Allow up to 1 error of any type and print the number of hits for each hit strategy (use seqan3::search_cfg::strata{1}).

Hint

You can use std::ranges::distance to get the size of any range. Depending on the underlying range properties, this algorithm will use the optimal way to compute the number of elements contained in the range.

Solution

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
 
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/search/fm_index/fm_index.hpp>
#include <seqan3/search/search.hpp>
 
int main()
{
    using namespace seqan3::literals;
 
    seqan3::dna4_vector text{
        "CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
    seqan3::dna4_vector query{"GCT"_dna4};
 
    seqan3::fm_index index{text};
 
    seqan3::debug_stream << "Searching all hits\n";
    seqan3::configuration cfg = seqan3::search_cfg::max_error_total{seqan3::search_cfg::error_count{1}}
                              | seqan3::search_cfg::hit{seqan3::search_cfg::hit_all{}};
    auto results_all = search(query, index, cfg);
    // Attention: results_all is a pure std::ranges::input_range,
    //            so after calling std::ranges::distance, you cannot iterate over it again!
    seqan3::debug_stream << "There are " << std::ranges::distance(results_all) << " hits.\n";
 
    seqan3::debug_stream << "Searching all best hits\n";
    using seqan3::get;
    get<seqan3::search_cfg::hit>(cfg).hit_variant = seqan3::search_cfg::hit_all_best{};
    auto results_all_best = search(query, index, cfg);
    // Attention: results_all_best is a pure std::ranges::input_range,
    //            so after calling std::ranges::distance, you cannot iterate over it again!
    seqan3::debug_stream << "There are " << std::ranges::distance(results_all_best) << " hits.\n";
 
    seqan3::debug_stream << "Searching best hit\n";
    get<seqan3::search_cfg::hit>(cfg).hit_variant = seqan3::search_cfg::hit_single_best{};
    auto results_best = search(query, index, cfg);
    // Attention: results_best is a pure std::ranges::input_range,
    //            so after calling std::ranges::distance, you cannot iterate over it again!
    seqan3::debug_stream << "There is " << std::ranges::distance(results_best) << " hits.\n";
 
    seqan3::debug_stream << "Searching all hits in the 1-stratum\n";
    get<seqan3::search_cfg::hit>(cfg).hit_variant = seqan3::search_cfg::hit_strata{1};
    auto results_strata = search(query, index, cfg);
    // Attention: results_strata is a pure std::ranges::input_range,
    //            so after calling std::ranges::distance, you cannot iterate over it again!
    seqan3::debug_stream << "There are " << std::ranges::distance(results_strata) << " hits.\n";
}

Expected output:

Searching all hits
There are 25 hits.
Searching all best hits
There are 3 hits.
Searching best hit
There is 1 hit.
Searching all hits in the 1-stratum
There are 25 hits.

Controlling the search output

When calling the search algorithm, a lazy range over seqan3::search_result objects is returned. Each result object represents a single hit. This means that merely calling the seqan3::search algorithm will do nothing except configure the search algorithm based on the given search configuration, query and index. Only when iterating over the lazy search result range, the actual search for every query is triggered. We have done this automatically in the previous examples when printing the result to the seqan3::debug_stream which then invokes a range based iteration over the returned range or by using the std::ranges::distance algorithm. However, in many cases, we want to access the specific positions and information stored in the seqan3::search_result object to proceed with our application. Since some information might be more compute-intensive than others, there is a way to control what the final search result object will contain.

Output configuration

The behaviour of the search algorithm is further controlled through the seqan3::search_cfg::output_* configurations. The following output configurations exists:

Similarly to the max error configurations, you can arbitrarily combine the configurations to customise the final output. For example, if you are only interested in the position of the hit within the reference sequence, you can use the seqan3::search_cfg::output_reference_begin_position configuration. Instead, if you need access to the index where the hit was found, you can use the seqan3::search_cfg::output_index_cursor configuration.

Note: If you do not provide any output configuration, then the query id and reference id as well as the reference begin position will be automatically reported. If you select only one in your search configuration, then only this one will be available in the final search result.

One last exercise

In the final example, we will extend our previous search examples to also compute the alignment of the found hits and their respective reference infixes. To do so, we recommend working through the Pairwise Alignment tutorial first.

Assignment 5

Search for all occurrences of GCT in the text from assignment 1.
Allow up to 1 error of any type and search for all occurrences with the strategy seqan3::search_cfg::hit_all_best.
Align the query to each of the found positions in the genome and print the score and alignment.
BONUS
Do the same for the text collection from assignment 2.

Hint

The search will give you positions in the text. To access the corresponding subrange of the text you can use std::span:

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
 
#include <span>
 
#include <seqan3/core/debug_stream.hpp>
 
int main()
{
    std::string text{"Garfield the fat cat without a hat."};
    size_t start{2};
    size_t span{3};
 
    std::span text_view{std::data(text) + start, span}; // represent interval [2, 4]
 
    seqan3::debug_stream << text_view << '\n'; // Prints "rfi"
}

Solution

// SPDX-FileCopyrightText: 2006-2024 Knut Reinert & Freie Universität Berlin
// SPDX-FileCopyrightText: 2016-2024 Knut Reinert & MPI für molekulare Genetik
// SPDX-License-Identifier: CC0-1.0
 
#include <span>
 
#include <seqan3/alignment/configuration/all.hpp>
#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/alphabet/nucleotide/dna4.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/search/fm_index/fm_index.hpp>
#include <seqan3/search/search.hpp>
 
using namespace seqan3::literals;
 
// Define the pairwise alignment configuration globally.
inline constexpr auto align_config =
    seqan3::align_cfg::method_global{seqan3::align_cfg::free_end_gaps_sequence1_leading{true},
                                     seqan3::align_cfg::free_end_gaps_sequence2_leading{false},
                                     seqan3::align_cfg::free_end_gaps_sequence1_trailing{true},
                                     seqan3::align_cfg::free_end_gaps_sequence2_trailing{false}}
    | seqan3::align_cfg::edit_scheme | seqan3::align_cfg::output_alignment{} | seqan3::align_cfg::output_score{};
 
void run_text_single()
{
    seqan3::dna4_vector text{
        "CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
    seqan3::dna4_vector query{"GCT"_dna4};
    seqan3::fm_index index{text};
 
    seqan3::debug_stream << "Searching all best hits allowing for 1 error in a single text\n";
 
    seqan3::configuration const search_config =
        seqan3::search_cfg::max_error_total{seqan3::search_cfg::error_count{1}} | seqan3::search_cfg::hit_all_best{};
 
    auto search_results = search(query, index, search_config);
 
    seqan3::debug_stream << "-----------------\n";
 
    for (auto && hit : search_results)
    {
        size_t start = hit.reference_begin_position() ? hit.reference_begin_position() - 1 : 0;
        std::span text_view{std::data(text) + start, query.size() + 1};
 
        for (auto && res : align_pairwise(std::tie(text_view, query), align_config))
        {
            auto && [aligned_database, aligned_query] = res.alignment();
            seqan3::debug_stream << "score:    " << res.score() << '\n';
            seqan3::debug_stream << "database: " << aligned_database << '\n';
            seqan3::debug_stream << "query:    " << aligned_query << '\n';
            seqan3::debug_stream << "=============\n";
        }
    }
}
 
void run_text_collection()
{
    std::vector<seqan3::dna4_vector> text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTA"_dna4,
                                          "ACCCGATGAGCTACCCAGTAGTCGAACTG"_dna4,
                                          "GGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
    seqan3::dna4_vector query{"GCT"_dna4};
    seqan3::fm_index index{text};
 
    seqan3::debug_stream << "Searching all best hits allowing for 1 error in a text collection\n";
 
    seqan3::configuration const search_config =
        seqan3::search_cfg::max_error_total{seqan3::search_cfg::error_count{1}} | seqan3::search_cfg::hit_all_best{};
 
    seqan3::debug_stream << "-----------------\n";
 
    for (auto && hit : search(query, index, search_config))
    {
        size_t start = hit.reference_begin_position() ? hit.reference_begin_position() - 1 : 0;
        std::span text_view{std::data(text[hit.reference_id()]) + start, query.size() + 1};
 
        for (auto && res : align_pairwise(std::tie(text_view, query), align_config))
        {
            auto && [aligned_database, aligned_query] = res.alignment();
            seqan3::debug_stream << "score:    " << res.score() << '\n';
            seqan3::debug_stream << "database: " << aligned_database << '\n';
            seqan3::debug_stream << "query:    " << aligned_query << '\n';
            seqan3::debug_stream << "=============\n";
        }
    }
}
 
int main()
{
    run_text_single();
    seqan3::debug_stream << '\n';
    run_text_collection();
}

Expected output:

Searching all best hits allowing for 1 error in a single text
There are 3 hits.
-----------------
score:    0
database: GCT
query:    GCT
=============
score:    0
database: GCT
query:    GCT
=============
score:    0
database: GCT
query:    GCT
=============
 
Searching all best hits allowing for 1 error in a text collection
There are 3 hits.
-----------------
score:    0
database: GCT
query:    GCT
=============
score:    0
database: GCT
query:    GCT
=============
score:    0
database: GCT
query:    GCT
=============

Difficulty	Moderate
Duration	60 Minutes
Prerequisite tutorials	Ranges (Recommended) Pairwise Alignment (only for last assignment)
Recommended reading	FM-Index paper FM-Index on Wikipedia Bi-FM-Index paper

Table of Contents

Introduction

Capabilities

Terminology

Reference

Query

Index

FM-Index

Bi-FM-Index

Example

Assignment 1

Search

Terminology

Searching for exact hits

Assignment 2

Searching for approximate hits

Max error configuration

Assignment 3

Which hits are reported?

Hit configuration

Assignment 4

Controlling the search output

Output configuration

One last exercise

Assignment 5