Introduction

Exact and approximate string matching is a common problem in bioinformatics, e.g. in read mapping. Usually, we want to search for a match of one or many small sequences (queries) in a database consisting of one or more large sequences (references). Trivial approaches such as on-line search fail at this task for large data due to prohibitive run times.

The general solution is to use a data structure called index. An index is built over the reference and only needs to be computed once for a given data set. It is used to speed up the search in the reference and can be re-used by storing it to disk and loading it again without recomputation.

There are two groups of indices: hash tables and suffix-based indices. SeqAn3 implements the FM-Index and Bi-FM-Index as suffix-based indices. While the Bi-FM-Index needs more space (a factor of about 2), it allows faster searches.

Given an index, SeqAn3 will choose the best search strategy for you. Since very different algorithms may be selected internally depending on the configuration, it is advisable to do benchmarks with your application. A rule of thumb is to use the Bi-FM-Index when allowing more than 2 errors.

This tutorial will show you how to use the seqan3::fm_index and seqan3::bi_fm_index to create indices and how to search them efficiently using seqan3::search.

Capabilities

With this module you can:

Create, store and load (Bi)-FM-Indices
Search for exact matches
Search for approximate matches (allowing substitutions and indels)

The results of the search can be passed on to other modules, e.g. to create an alignment.

Terminology

Reference

A reference is the data you want to search in, e.g. a genome or protein database.

Query

A query is the data you want to search for in the reference, e.g. a read or an amino acid sequence.

Index

An index is a data structure built over the reference that allows fast searches.

FM-Index

The full-text minute index (FM-Index) is an index that is similar to a suffix tree, but much smaller. It is used by most state-of-the-art read mappers and aligners. You can find more information on FM-Indicies in the original publication and on Wikipedia.

Bi-FM-Index

The bidirectional FM-Index (Bi-FM-Index) is an extension of the FM-Index that enables faster searches, especially when allowing multiple errors. But it uses almost twice the amount of memory the FM-Index uses. You can find more information on Bi-FM-Indicies here.

Example

Constructing a (Bi-)FM-Index is very simple:

#include <seqan3/search/fm_index/bi_fm_index.hpp>   // for using the bi_fm_index
#include <seqan3/search/fm_index/fm_index.hpp>      // for using the fm_index
using namespace seqan3;
int main()
{
    std::string text{"Garfield the fat cat without a hat."};
    fm_index index{text};        // unidirectional index on single text
    bi_fm_index bi_index{text};  // bidirectional index on single text
}

You can also index text collections (e.g. genomes with multiple chromosomes or protein databases):

std::vector<std::string> text{{"Garfield the fat cat without a hat."},
                              {"He is infinite, he is eternal."},
                              {"Yet another text I have to think about."}};
fm_index index{text};
bi_fm_index bi_index{text};

The indices can also be stored and loaded from disk by using cereal.

#include <fstream>                                  // for writing/reading files
#include <cereal/archives/binary.hpp>               // for storing/loading indices via cereal
std::string text{"Garfield the fat cat without a hat."};
fm_index index{text};
{
    std::ofstream os{"index.file", std::ios::binary};
    cereal::BinaryOutputArchive oarchive{os};
    oarchive(index);
}

fm_index<false> index; // we need to tell the index that we work on a text collection or not before loading
{
    std::ifstream is{"index.file", std::ios::binary};
    cereal::BinaryInputArchive iarchive{is};
    iarchive(index);
}

Note that in contrast to the construction via a given text, the template cannot be deduced by the compiler when using the default constructor so you have to provide template arguments.

Assignment 1

You are given the text

dna4_vector text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};

Create a seqan3::fm_index over the reference, store the index and load the index into a new seqan3::fm_index object. Print whether the indices are identical or differ.

Solution

#include <fstream>
#include <cereal/archives/binary.hpp>
#include <seqan3/search/fm_index/fm_index.hpp>
using namespace seqan3;
int main()
{
    dna4_vector text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
    fm_index index{text};
    {
        std::ofstream os{"index.file", std::ios::binary};
        cereal::BinaryOutputArchive oarchive{os};
        oarchive(index);
    }
    fm_index<false> index2; // we need to tell the index that we work on a text collection or not before loading
    {
        std::ifstream is{"index.file", std::ios::binary};
        cereal::BinaryInputArchive iarchive{is};
        iarchive(index2);
    }
    if (index == index2)
        std::cout << "The indices are identical!\n";
    else
        std::cout << "The indices differ!\n";
}

Expected output:

The indices are identical!

Search

Using an index, we can now conduct searches for a given query. In this part we will learn how to search exactly, allow substitutions and indels, and how to configure what kind of results we want, e.g. all results vs. only the best result.

Terminology

Match

If a query can be found in the reference with the specified error configuration, we denote this result as a match. If the query matches the reference without any errors, we call this an exact match. If a match contains errors, it's an approximate match.

Searching for exact matches

We can search for all exact matches using seqan3::search:

#include <seqan3/core/debug_stream.hpp>       // pretty printing
#include <seqan3/search/algorithm/search.hpp> // this already includes (bi)_fm_index
using namespace seqan3;
using namespace std::string_literals; // for using the ""s string literal
int main()
{
    std::string text{"Garfield the fat cat without a hat."};
    fm_index index{text};
    debug_stream << search("cat"s, index) << '\n'; // [17]
}

When using a single text, the search will return a std::vector of the begin positions of matches in the reference.

For text collections, a std::vector over std::tuple where the first element represents the index of the text in the collection and the second element represents the position within the text is returned.

You can also pass multiple queries at the same time:

std::string text{"Garfield the fat cat without a hat."};
fm_index index{text};
std::vector<std::string> query{"cat"s, "hat"s};
debug_stream << search(query, index) << '\n'; // [[17],[31]]

The returned result will be a vector of individual results. Each element refers to the respective query.

Assignment 2

Search for all exact occurrences of GCT in the text from assignment 1.
Print the number of hits and the positions in a ascending ordering.
Do the same for the following text collection:

std::vector<dna4_vector> text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTA"_dna4,
                              "ACCCGATGAGCTACCCAGTAGTCGAACTG"_dna4,
                              "GGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};

Solution

#include <seqan3/core/debug_stream.hpp>
#include <seqan3/search/algorithm/search.hpp>
using namespace seqan3;
void run_text_single()
{
    dna4_vector text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
    fm_index index{text};
    auto results = search("GCT"_dna4, index);
    std::ranges::sort(results);
    debug_stream << "=====   Running on a single text   =====\n";
    debug_stream << "There are " << results.size() << " hits.\n";
    debug_stream << "The positions are " << results << '\n';
}
void run_text_collection()
{
    std::vector<dna4_vector> text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTA"_dna4,
                                  "ACCCGATGAGCTACCCAGTAGTCGAACTG"_dna4,
                                  "GGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
    fm_index index{text};
    auto results = search("GCT"_dna4, index);
    std::ranges::sort(results);
    debug_stream << "===== Running on a text collection =====\n";
    debug_stream << "There are " << results.size() << " hits.\n";
    debug_stream << "The positions are " << results << '\n';
}
int main()
{
   run_text_single();
   debug_stream << '\n';
   run_text_collection();
}

Expected output:

=====   Running on a single text   =====
There are 3 hits.
The positions are [1,41,77]
===== Running on a text collection =====
There are 3 hits.
The positions are [(0,1),(1,9),(2,16)]

Searching for approximate matches

If you want to allow for errors in your query, you need to configure the approximate search with a seqan3::search_cfg::max_error or seqan3::search_cfg::max_error_rate object. These are constructed with four parameters:

With absolute numbers (seqan3::search_cfg::max_error):

seqan3::search_cfg::total: Maximum number of total errors
seqan3::search_cfg::substitution: Maximum number of substitutions
seqan3::search_cfg::insertion: Maximum number of insertions
seqan3::search_cfg::deletion: Maximum number of deletions

Use seqan3::search_cfg::max_error_rate with the same elements to define error rates $\in[0,1]$ .

Note: When using >= 2 errors it is advisable to use a Bi-FM-Index since searches will be faster.

To search for either 1 insertion or 1 deletion you can write:

std::string text{"Garfield the fat cat without a hat."};
fm_index index{text};
configuration const cfg = search_cfg::max_error{search_cfg::total{1},
                                                search_cfg::substitution{0},
                                                search_cfg::insertion{1},
                                                search_cfg::deletion{1}};
debug_stream << search("cat"s, index, cfg) << '\n'; // [14,17,18,32]

Note that the error amount/rates do not have to sum up to the total of errors allowed:

configuration const cfg = search_cfg::max_error{search_cfg::total{2},
                                                search_cfg::substitution{2},
                                                search_cfg::insertion{1},
                                                search_cfg::deletion{1}};

Allows 2 errors which can be any combination of 2 substituions, 1 insertion and 1 deletion.

Defining only the total will set all error types to this value.

Defining any of the error types, but no total, will set the total to the sum of all error types; if a total is set, it will act as an upper bound.

Assignment 3

Search for all occurrences of GCT in the text from assignment 1.

Allow up to 1 substitution and print the matching substring of the reference.

Hint

The search will give you positions in the text. To access the corresponding subrange of the text you can use std::span:

#include <seqan3/core/debug_stream.hpp>
#include <seqan3/std/span>
using namespace seqan3;
int main()
{
    std::string text{"Garfield the fat cat without a hat."};
    size_t start{2};
    size_t span{3};
    std::span text_view{std::data(text) + start, span}; // represent interval [2, 4]
    debug_stream << text_view << '\n'; // Prints "rfi"
}

Solution

#include <seqan3/core/debug_stream.hpp>
#include <seqan3/search/algorithm/search.hpp>
#include <seqan3/std/span>
using namespace seqan3;
int main()
{
    dna4_vector text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
    fm_index index{text};
    configuration const cfg = search_cfg::max_error{search_cfg::substitution{1}};
    auto results = search("GCT"_dna4, index, cfg);
    std::ranges::sort(results);
    debug_stream << "There are " << results.size() << " hits.\n";
    for (auto && pos : results)
    {
        debug_stream << "At position " << pos << ": "
                     << std::span{std::data(text) + pos, 3}
                     << '\n';
    }
}

Expected output:

There are 12 hits.
At position 1: GCT
At position 5: TCT
At position 12: GAT
At position 23: GCC
At position 36: GAT
At position 41: GCT
At position 57: ACT
At position 62: GCC
At position 75: GCG
At position 77: GCT
At position 83: GCA
At position 85: ACT

Search modes

Besides the error configuration, you can define what kind of hits should be reported:

seqan3::search_cfg::all: Report all hits that satisfy the (approximate) search.
seqan3::search_cfg::best: Report the best hit, i.e. the first hit with the lowest edit distance.
seqan3::search_cfg::all_best: Report all hits with the lowest edit distance.
seqan3::search_cfg::strata: best+x mode. Report all hits within the x-neighbourhood of the best hit.

The mode is appended to the error configuration by using the |-operator:

configuration const cfg = search_cfg::max_error{search_cfg::total{1},
                                                search_cfg::substitution{0},
                                                search_cfg::insertion{1},
                                                search_cfg::deletion{1}} |
                          search_cfg::mode{search_cfg::best};

The strata mode needs an additional parameter:

configuration const cfg = search_cfg::max_error{search_cfg::total{2},
                                                search_cfg::substitution{0},
                                                search_cfg::insertion{1},
                                                search_cfg::deletion{1}} |
                          search_cfg::mode{search_cfg::strata{2}};

If the best hit had an edit distance of 1, the strata mode would report all hits with up to an edit distance of 3. Since in this example the total error number is set to 2, all hits with 1 or 2 errors would be reported.

Assignment 4

Search for all occurrences of GCT in the text from assignment 1.
Allow up to 1 error of any type and print the number of hits for each search mode (use mode::strata{1}).

Solution

#include <seqan3/core/debug_stream.hpp>
#include <seqan3/search/algorithm/search.hpp>
using namespace seqan3;
int main()
{
    dna4_vector text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
    dna4_vector query{"GCT"_dna4};
    fm_index index{text};
    debug_stream << "Searching all hits\n";
    configuration const cfg_all = search_cfg::max_error{search_cfg::total{1}} |
                                  search_cfg::mode{search_cfg::all};
    auto results_all = search(query, index, cfg_all);
    debug_stream << "There are " << results_all.size() << " hits.\n";
    debug_stream << "Searching all best hits\n";
    configuration const cfg_all_best = search_cfg::max_error{search_cfg::total{1}} |
                                       search_cfg::mode{search_cfg::all_best};
    auto results_all_best = search(query, index, cfg_all_best);
    debug_stream << "There are " << results_all_best.size() << " hits.\n";
    debug_stream << "Searching best hit\n";
    configuration const cfg_best = search_cfg::max_error{search_cfg::total{1}} |
                                   search_cfg::mode{search_cfg::best};
    auto results_best = search(query, index, cfg_best);
    debug_stream << "There is " << results_best.size() << " hit.\n";
    debug_stream << "Searching all hits in the 1-stratum\n";
    configuration const cfg_strata = search_cfg::max_error{search_cfg::total{1}} |
                                     search_cfg::mode{search_cfg::strata{1}};
    auto results_strata = search(query, index, cfg_strata);
    debug_stream << "There are " << results_strata.size() << " hits.\n";
}

Expected output:

Searching all hits
There are 25 hits.
Searching all best hits
There are 3 hits.
Searching best hit
There is 1 hit.
Searching all hits in the 1-stratum
There are 25 hits.

One last exercise

Assignment 5

Search for all occurrences of GCT in the text from assignment 1.
Allow up to 1 error of any type search for all occurrences in the all_best mode.
Align the query to each of the found positions in the genome and print the score and alignment.
BONUS
Do the same for the text collection from assignment 2.

Hint

Similarly to assignment 3, you have to extract the subrange of the text in order to align against it:

// std::span returnes the subrange [start, start + span - 1], i.e. span many elements beginning at start
// For single texts.
std::span text_view{std::data(text) + start, span};
// For text collections. idx is the current index of the text collection returned by the search.
std::span text_view{std::data(text[idx]) + start, span};

Solution

#include <seqan3/alignment/configuration/all.hpp>
#include <seqan3/alignment/pairwise/align_pairwise.hpp>
#include <seqan3/core/debug_stream.hpp>
#include <seqan3/search/algorithm/search.hpp>
#include <seqan3/std/span>
using namespace seqan3;
void run_text_single()
{
    dna4_vector text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
    dna4_vector query{"GCT"_dna4};
    fm_index index{text};
    debug_stream << "Searching all best hits allowing for 1 error in a single text\n";
    configuration const search_config = search_cfg::max_error{search_cfg::total{1}} |
                                        search_cfg::mode{search_cfg::all_best};
    configuration const align_config = align_cfg::edit |
                                       align_cfg::aligned_ends{free_ends_first} |
                                       align_cfg::result{with_alignment};
    auto results = search(query, index, search_config);
    debug_stream << "There are " << results.size() << " hits.\n";
    debug_stream << "-----------------\n";
    for (auto && pos : results)
    {
        size_t start = pos ? pos - 1 : 0;
        std::span text_view{std::data(text) + start, query.size() + 1};
        for (auto && res : align_pairwise(std::tie(text_view, query), align_config))
        {
            auto && [aligned_database, aligned_query] = res.alignment();
            debug_stream << "score:    " << res.score() << '\n';
            debug_stream << "database: " << aligned_database << '\n';
            debug_stream << "query:    "  << aligned_query << '\n';
            debug_stream << "=============\n";
        }
    }
}
void run_text_collection()
{
    std::vector<dna4_vector> text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTA"_dna4,
                                  "ACCCGATGAGCTACCCAGTAGTCGAACTG"_dna4,
                                  "GGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
    dna4_vector query{"GCT"_dna4};
    fm_index index{text};
    debug_stream << "Searching all best hits allowing for 1 error in a text collection\n";
    configuration const search_config = search_cfg::max_error{search_cfg::total{1}} |
                                        search_cfg::mode{search_cfg::all_best};
    configuration const align_config = align_cfg::edit |
                                       align_cfg::aligned_ends{free_ends_first} |
                                       align_cfg::result{with_alignment};
    auto results = search(query, index, search_config);
    debug_stream << "There are " << results.size() << " hits.\n";
    debug_stream << "-----------------\n";
    for (auto [idx, pos] : results)
    {
        size_t start = pos ? pos - 1 : 0;
        std::span text_view{std::data(text[idx]) + start, query.size() + 1};
        for (auto && res : align_pairwise(std::tie(text_view, query), align_config))
        {
            auto && [aligned_database, aligned_query] = res.alignment();
            debug_stream << "score:    " << res.score() << '\n';
            debug_stream << "database: " << aligned_database << '\n';
            debug_stream << "query:    "  << aligned_query << '\n';
            debug_stream << "=============\n";
        }
    }
}
int main()
{
   run_text_single();
   debug_stream << '\n';
   run_text_collection();
}

Expected output:

Searching all best hits allowing for 1 error in a single text
There are 3 hits.
-----------------
score:    0
database: GCT
query:    GCT
=============
score:    0
database: GCT
query:    GCT
=============
score:    0
database: GCT
query:    GCT
=============
Searching all best hits allowing for 1 error in a text collection
There are 3 hits.
-----------------
score:    0
database: GCT
query:    GCT
=============
score:    0
database: GCT
query:    GCT
=============
score:    0
database: GCT
query:    GCT
=============

Difficulty	Moderate
Duration	60 Minutes
Prerequisite tutorials	Ranges (Recommended) Pairwise Alignment (only for last assignment)
Recommended reading	FM-Index paper FM-Index on Wikipedia Bi-FM-Index paper

Table of Contents

Introduction

Capabilities

Terminology

Reference

Query

Index

FM-Index

Bi-FM-Index

Example

Assignment 1

Search

Terminology

Match

Searching for exact matches

Assignment 2

Searching for approximate matches

Assignment 3

Search modes

Assignment 4

One last exercise

Assignment 5