SeqAn3  3.0.1
The Modern C++ library for sequence analysis.
Indexing and searching with SeqAn

Learning Objective:
In this tutorial you will learn how to construct an index and conduct searches.

DifficultyModerate
Duration60 Minutes
Prerequisite tutorialsRanges (Recommended)
Pairwise Alignment (only for last assignment)
Recommended readingFM-Index paper
FM-Index on Wikipedia
Bi-FM-Index paper

Introduction

Exact and approximate string matching is a common problem in bioinformatics, e.g. in read mapping. Usually, we want to search for a match of one or many small sequences (queries) in a database consisting of one or more large sequences (references). Trivial approaches such as on-line search fail at this task for large data due to prohibitive run times.

The general solution is to use a data structure called index. An index is built over the reference and only needs to be computed once for a given data set. It is used to speed up the search in the reference and can be re-used by storing it to disk and loading it again without recomputation.

There are two groups of indices: hash tables and suffix-based indices. SeqAn implements the FM-Index and Bi-FM-Index as suffix-based indices. While the Bi-FM-Index needs more space (a factor of about 2), it allows faster searches.

Given an index, SeqAn will choose the best search strategy for you. Since very different algorithms may be selected internally depending on the configuration, it is advisable to do benchmarks with your application. A rule of thumb is to use the Bi-FM-Index when allowing more than 2 errors.

This tutorial will show you how to use the seqan3::fm_index and seqan3::bi_fm_index to create indices and how to search them efficiently using seqan3::search.

Capabilities

With this module you can:

  • Create, store and load (Bi)-FM-Indices
  • Search for exact matches
  • Search for approximate matches (allowing substitutions and indels)

The results of the search can be passed on to other modules, e.g. to create an alignment.

Terminology

Reference

A reference is the data you want to search in, e.g. a genome or protein database.

Query

A query is the data you want to search for in the reference, e.g. a read or an amino acid sequence.

Index

An index is a data structure built over the reference that allows fast searches.

FM-Index

The full-text minute index (FM-Index) is an index that is similar to a suffix tree, but much smaller. It is used by most state-of-the-art read mappers and aligners. You can find more information on FM-Indicies in the original publication and on Wikipedia.

Bi-FM-Index

The bidirectional FM-Index (Bi-FM-Index) is an extension of the FM-Index that enables faster searches, especially when allowing multiple errors. But it uses almost twice the amount of memory the FM-Index uses. You can find more information on Bi-FM-Indicies here.

Example

Constructing a (Bi-)FM-Index is very simple:

#include <seqan3/search/fm_index/bi_fm_index.hpp> // for using the bi_fm_index
#include <seqan3/search/fm_index/fm_index.hpp> // for using the fm_index
int main()
{
std::string text{"Garfield the fat cat without a hat."};
seqan3::fm_index index{text}; // unidirectional index on single text
seqan3::bi_fm_index bi_index{text}; // bidirectional index on single text
}

You can also index text collections (e.g. genomes with multiple chromosomes or protein databases):

std::vector<std::string> text{{"Garfield the fat cat without a hat."},
{"He is infinite, he is eternal."},
{"Yet another text I have to think about."}};
seqan3::fm_index index{text};
seqan3::bi_fm_index bi_index{text};

The indices can also be stored and loaded from disk by using cereal.

#include <fstream> // for writing/reading files
#include <cereal/archives/binary.hpp> // for storing/loading indices via cereal
std::string text{"Garfield the fat cat without a hat."};
seqan3::fm_index index{text};
{
std::ofstream os{"index.file", std::ios::binary};
cereal::BinaryOutputArchive oarchive{os};
oarchive(index);
}
// we need to tell the index that we work on a single text and a `char` alphabet before loading
{
std::ifstream is{"index.file", std::ios::binary};
cereal::BinaryInputArchive iarchive{is};
iarchive(index);
}

Note that in contrast to the construction via a given text, the template cannot be deduced by the compiler when using the default constructor so you have to provide template arguments.

Assignment 1

You are given the text

dna4_vector text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};

Create a seqan3::fm_index over the reference, store the index and load the index into a new seqan3::fm_index object. Print whether the indices are identical or differ.

Solution

#include <fstream>
#include <cereal/archives/binary.hpp>
using seqan3::operator""_dna4;
int main()
{
seqan3::dna4_vector
text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
seqan3::fm_index index{text};
{
std::ofstream os{"index.file", std::ios::binary};
cereal::BinaryOutputArchive oarchive{os};
oarchive(index);
}
// we need to tell the index that we work on a single text and a `dna4` alphabet before loading
{
std::ifstream is{"index.file", std::ios::binary};
cereal::BinaryInputArchive iarchive{is};
iarchive(index2);
}
if (index == index2)
std::cout << "The indices are identical!\n";
else
std::cout << "The indices differ!\n";
}

Expected output:

The indices are identical!

Search

Using an index, we can now conduct searches for a given query. In this part we will learn how to search exactly, allow substitutions and indels, and how to configure what kind of results we want, e.g. all results vs. only the best result.

Terminology

Match

If a query can be found in the reference with the specified error configuration, we denote this result as a match. If the query matches the reference without any errors, we call this an exact match. If a match contains errors, it's an approximate match.

Searching for exact matches

We can search for all exact matches using seqan3::search:

#include <seqan3/core/debug_stream.hpp> // pretty printing
#include <seqan3/search/algorithm/search.hpp> // this already includes (bi)_fm_index
using namespace std::string_literals; // for using the ""s string literal
int main()
{
std::string text{"Garfield the fat cat without a hat."};
seqan3::fm_index index{text};
seqan3::debug_stream << search("cat"s, index) << '\n'; // [17]
}

When using a single text, the search will return a std::vector of the begin positions of matches in the reference.

For text collections, a std::vector over std::tuple where the first element represents the index of the text in the collection and the second element represents the position within the text is returned.

You can also pass multiple queries at the same time:

std::string text{"Garfield the fat cat without a hat."};
seqan3::fm_index index{text};
std::vector<std::string> query{"cat"s, "hat"s};
seqan3::debug_stream << search(query, index) << '\n'; // [[17],[31]]

The returned result will be a vector of individual results. Each element refers to the respective query.

Assignment 2

Search for all exact occurrences of GCT in the text from assignment 1.
Print the number of hits and the positions in a ascending ordering.
Do the same for the following text collection:

std::vector<dna4_vector> text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTA"_dna4,
"ACCCGATGAGCTACCCAGTAGTCGAACTG"_dna4,
"GGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};

Solution

using seqan3::operator""_dna4;
void run_text_single()
{
seqan3::dna4_vector
text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
seqan3::fm_index index{text};
auto results = search("GCT"_dna4, index);
std::ranges::sort(results);
seqan3::debug_stream << "===== Running on a single text =====\n";
seqan3::debug_stream << "There are " << results.size() << " hits.\n";
seqan3::debug_stream << "The positions are " << results << '\n';
}
void run_text_collection()
{
std::vector<seqan3::dna4_vector> text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTA"_dna4,
"ACCCGATGAGCTACCCAGTAGTCGAACTG"_dna4,
"GGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
seqan3::fm_index index{text};
auto results = search("GCT"_dna4, index);
std::ranges::sort(results);
seqan3::debug_stream << "===== Running on a text collection =====\n";
seqan3::debug_stream << "There are " << results.size() << " hits.\n";
seqan3::debug_stream << "The positions are " << results << '\n';
}
int main()
{
run_text_single();
run_text_collection();
}

Expected output:

===== Running on a single text =====
There are 3 hits.
The positions are [1,41,77]
===== Running on a text collection =====
There are 3 hits.
The positions are [(0,1),(1,9),(2,16)]

Searching for approximate matches

If you want to allow for errors in your query, you need to configure the approximate search with a seqan3::search_cfg::max_error or seqan3::search_cfg::max_error_rate object. These are constructed with four parameters:

With absolute numbers (seqan3::search_cfg::max_error):

Use seqan3::search_cfg::max_error_rate with the same elements to define error rates $\in[0,1]$.

Note
When using >= 2 errors it is advisable to use a Bi-FM-Index since searches will be faster.

To search for either 1 insertion or 1 deletion you can write:

std::string text{"Garfield the fat cat without a hat."};
seqan3::fm_index index{text};
seqan3::debug_stream << search("cat"s, index, cfg) << '\n'; // [14,17,18,32]

Note that the error amount/rates do not have to sum up to the total of errors allowed:

Allows 2 errors which can be any combination of 2 substituions, 1 insertion and 1 deletion.

Defining only the total will set all error types to this value.

Defining any of the error types, but no total, will set the total to the sum of all error types; if a total is set, it will act as an upper bound.

Assignment 3

Search for all occurrences of GCT in the text from assignment 1.

Allow up to 1 substitution and print the matching substring of the reference.

Hint The search will give you positions in the text. To access the corresponding subrange of the text you can use std::span:

#include <seqan3/std/span>
int main()
{
std::string text{"Garfield the fat cat without a hat."};
size_t start{2};
size_t span{3};
std::span text_view{std::data(text) + start, span}; // represent interval [2, 4]
seqan3::debug_stream << text_view << '\n'; // Prints "rfi"
}

Solution

#include <seqan3/std/span>
using seqan3::operator""_dna4;
int main()
{
seqan3::dna4_vector
text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
seqan3::fm_index index{text};
auto results = search("GCT"_dna4, index, cfg);
std::ranges::sort(results);
seqan3::debug_stream << "There are " << results.size() << " hits.\n";
for (auto && pos : results)
{
seqan3::debug_stream << "At position " << pos << ": "
<< std::span{std::data(text) + pos, 3}
<< '\n';
}
}

Expected output:

There are 12 hits.
At position 1: GCT
At position 5: TCT
At position 12: GAT
At position 23: GCC
At position 36: GAT
At position 41: GCT
At position 57: ACT
At position 62: GCC
At position 75: GCG
At position 77: GCT
At position 83: GCA
At position 85: ACT

Search modes

Besides the error configuration, you can define what kind of hits should be reported:

The mode is appended to the error configuration by using the |-operator:

The strata mode needs an additional parameter:

If the best hit had an edit distance of 1, the strata mode would report all hits with up to an edit distance of 3. Since in this example the total error number is set to 2, all hits with 1 or 2 errors would be reported.

Assignment 4

Search for all occurrences of GCT in the text from assignment 1.
Allow up to 1 error of any type and print the number of hits for each search mode (use mode::strata{1}).

Solution

using seqan3::operator""_dna4;
int main()
{
seqan3::dna4_vector
text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
seqan3::dna4_vector query{"GCT"_dna4};
seqan3::fm_index index{text};
seqan3::debug_stream << "Searching all hits\n";
auto results_all = search(query, index, cfg_all);
seqan3::debug_stream << "There are " << results_all.size() << " hits.\n";
seqan3::debug_stream << "Searching all best hits\n";
auto results_all_best = search(query, index, cfg_all_best);
seqan3::debug_stream << "There are " << results_all_best.size() << " hits.\n";
seqan3::debug_stream << "Searching best hit\n";
auto results_best = search(query, index, cfg_best);
seqan3::debug_stream << "There is " << results_best.size() << " hit.\n";
seqan3::debug_stream << "Searching all hits in the 1-stratum\n";
auto results_strata = search(query, index, cfg_strata);
seqan3::debug_stream << "There are " << results_strata.size() << " hits.\n";
}

Expected output:

Searching all hits
There are 25 hits.
Searching all best hits
There are 3 hits.
Searching best hit
There is 1 hit.
Searching all hits in the 1-stratum
There are 25 hits.

One last exercise

Assignment 5

Search for all occurrences of GCT in the text from assignment 1.
Allow up to 1 error of any type search for all occurrences in the all_best mode.
Align the query to each of the found positions in the genome and print the score and alignment.
BONUS
Do the same for the text collection from assignment 2.
Hint Similarly to assignment 3, you have to extract the subrange of the text in order to align against it:

// std::span returnes the subrange [start, start + span - 1], i.e. span many elements beginning at start
// For single texts.
std::span text_view{std::data(text) + start, span};
// For text collections. idx is the current index of the text collection returned by the search.
std::span text_view{std::data(text[idx]) + start, span};

Solution

#include <seqan3/std/span>
using seqan3::operator""_dna4;
void run_text_single()
{
seqan3::dna4_vector
text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
seqan3::dna4_vector query{"GCT"_dna4};
seqan3::fm_index index{text};
seqan3::debug_stream << "Searching all best hits allowing for 1 error in a single text\n";
seqan3::align_cfg::aligned_ends{seqan3::free_ends_first} |
seqan3::align_cfg::result{seqan3::with_alignment};
auto results = search(query, index, search_config);
seqan3::debug_stream << "There are " << results.size() << " hits.\n";
seqan3::debug_stream << "-----------------\n";
for (auto && pos : results)
{
size_t start = pos ? pos - 1 : 0;
std::span text_view{std::data(text) + start, query.size() + 1};
for (auto && res : align_pairwise(std::tie(text_view, query), align_config))
{
auto && [aligned_database, aligned_query] = res.alignment();
seqan3::debug_stream << "score: " << res.score() << '\n';
seqan3::debug_stream << "database: " << aligned_database << '\n';
seqan3::debug_stream << "query: " << aligned_query << '\n';
seqan3::debug_stream << "=============\n";
}
}
}
void run_text_collection()
{
std::vector<seqan3::dna4_vector> text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTA"_dna4,
"ACCCGATGAGCTACCCAGTAGTCGAACTG"_dna4,
"GGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
seqan3::dna4_vector query{"GCT"_dna4};
seqan3::fm_index index{text};
seqan3::debug_stream << "Searching all best hits allowing for 1 error in a text collection\n";
seqan3::align_cfg::aligned_ends{seqan3::free_ends_first} |
seqan3::align_cfg::result{seqan3::with_alignment};
auto results = search(query, index, search_config);
seqan3::debug_stream << "There are " << results.size() << " hits.\n";
seqan3::debug_stream << "-----------------\n";
for (auto [idx, pos] : results)
{
size_t start = pos ? pos - 1 : 0;
std::span text_view{std::data(text[idx]) + start, query.size() + 1};
for (auto && res : align_pairwise(std::tie(text_view, query), align_config))
{
auto && [aligned_database, aligned_query] = res.alignment();
seqan3::debug_stream << "score: " << res.score() << '\n';
seqan3::debug_stream << "database: " << aligned_database << '\n';
seqan3::debug_stream << "query: " << aligned_query << '\n';
seqan3::debug_stream << "=============\n";
}
}
}
int main()
{
run_text_single();
run_text_collection();
}

Expected output:

Searching all best hits allowing for 1 error in a single text
There are 3 hits.
-----------------
score: 0
database: GCT
query: GCT
=============
score: 0
database: GCT
query: GCT
=============
score: 0
database: GCT
query: GCT
=============
Searching all best hits allowing for 1 error in a text collection
There are 3 hits.
-----------------
score: 0
database: GCT
query: GCT
=============
score: 0
database: GCT
query: GCT
=============
score: 0
database: GCT
query: GCT
=============

debug_stream.hpp
Provides seqan3::debug_stream and related types.
span
Provides std::span from the C++20 standard library.
fstream
seqan3::search
auto search(queries_t &&queries, index_t const &index, configuration_t const &cfg=search_cfg::default_configuration)
Search a query or a range of queries in an index.
Definition: search.hpp:167
std::string
align_pairwise.hpp
Provides pairwise alignment function.
seqan3::fm_index
The SeqAn FM Index.
Definition: fm_index.hpp:134
std::vector< std::string >
seqan3::search_cfg::strata
Configuration element to receive all hits with the best number of errors plus the strata value....
Definition: mode.hpp:57
seqan3::search_cfg::deletion
A strong type of underlying type uint8_t or double that represents the number or rate of deletions.
Definition: max_error_common.hpp:130
std::search
T search(T... args)
seqan3::align_pairwise
constexpr auto align_pairwise(sequence_t &&seq, alignment_config_t const &config)
Computes the pairwise alignment for a pair of sequences or a range over sequence pairs.
Definition: align_pairwise.hpp:139
seqan3::align_cfg::result
Sets the result of the alignment computation.
Definition: align_config_result.hpp:136
bi_fm_index.hpp
Provides the bidirectional seqan3::bi_fm_index.
seqan3::bi_fm_index
The SeqAn Bidirectional FM Index.
Definition: bi_fm_index.hpp:64
std::tie
T tie(T... args)
std::cout
seqan3::configuration
Collection of elements to configure an algorithm.
Definition: configuration.hpp:81
std::ofstream
search.hpp
Provides the public interface for search algorithms.
seqan3::align_cfg::aligned_ends
The configuration for aligned sequence ends.
Definition: align_config_aligned_ends.hpp:547
all.hpp
Meta-header for the alignment configuration module .
seqan3::search_cfg::all
constexpr detail::search_mode_all all
Configuration element to receive all hits within the error bounds.
Definition: mode.hpp:43
seqan3::search_cfg::substitution
A strong type of underlying type uint8_t or double that represents the number or rate of substitution...
Definition: max_error_common.hpp:64
seqan3::search_cfg::total
A strong type of underlying type uint8_t or double that represents the number or rate of total errors...
Definition: max_error_common.hpp:30
seqan3::search_cfg::max_error
A configuration element for the maximum number of errors across all error types (mismatches,...
Definition: max_error.hpp:43
seqan3::search_cfg::best
constexpr detail::search_mode_best best
Configuration element to receive one best hit (with the lowest number of errors).
Definition: mode.hpp:49
seqan3::search_cfg::insertion
A strong type of underlying type uint8_t or double that represents the number or rate of insertions.
Definition: max_error_common.hpp:97
seqan3::debug_stream
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition: debug_stream.hpp:39
seqan3::search_cfg::mode
Configuration element to determine the search mode.
Definition: mode.hpp:86
seqan3::align_cfg::edit
constexpr configuration edit
Shortcut for edit distance configuration.
Definition: align_config_edit.hpp:52
fm_index.hpp
Provides the unidirectional seqan3::fm_index.
seqan3::search_cfg::all_best
constexpr detail::search_mode_all_best all_best
Configuration element to receive all hits within the lowest number of errors.
Definition: mode.hpp:46
std::data
T data(T... args)
std::ifstream