SeqAn3  3.0.2
The Modern C++ library for sequence analysis.
Indexing and searching with SeqAn

Learning Objective:
In this tutorial you will learn how to construct an index and conduct searches.

DifficultyModerate
Duration60 Minutes
Prerequisite tutorialsRanges (Recommended)
Pairwise Alignment (only for last assignment)
Recommended readingFM-Index paper
FM-Index on Wikipedia
Bi-FM-Index paper

Introduction

Exact and approximate string matching is a common problem in bioinformatics, e.g. in read mapping. Usually, we want to search for a hit of one or many small sequences (queries) in a database consisting of one or more large sequences (references). Trivial approaches such as on-line search fail at this task for large data due to prohibitive run times.

The general solution is to use a data structure called index. An index is built over the reference and only needs to be computed once for a given data set. It is used to speed up the search in the reference and can be re-used by storing it to disk and loading it again without recomputation.

There are two groups of indices: hash tables and suffix-based indices. SeqAn implements the FM-Index and Bi-FM-Index as suffix-based indices. While the Bi-FM-Index needs more space (a factor of about 2), it allows faster searches.

Given an index, SeqAn will choose the best search strategy for you. Since very different algorithms may be selected internally depending on the configuration, it is advisable to do benchmarks with your application. A rule of thumb is to use the Bi-FM-Index when allowing more than 2 errors.

This tutorial will show you how to use the seqan3::fm_index and seqan3::bi_fm_index to create indices and how to search them efficiently using seqan3::search.

Capabilities

With this module you can:

  • Create, store and load (Bi)-FM-Indices
  • Search for exact hits
  • Search for approximate hits (allowing substitutions and indels)

The results of the search can be passed on to other modules, e.g. to create an alignment.

Terminology

Reference

A reference is the data you want to search in, e.g. a genome or protein database.

Query

A query is the data you want to search for in the reference, e.g. a read or an amino acid sequence.

Index

An index is a data structure built over the reference that allows fast searches.

FM-Index

The full-text minute index (FM-Index) is an index that is similar to a suffix tree, but much smaller. It is used by most state-of-the-art read mappers and aligners. You can find more information on FM-Indicies in the original publication and on Wikipedia.

Bi-FM-Index

The bidirectional FM-Index (Bi-FM-Index) is an extension of the FM-Index that enables faster searches, especially when allowing multiple errors. But it uses almost twice the amount of memory the FM-Index uses. You can find more information on Bi-FM-Indicies here.

Example

Constructing a (Bi-)FM-Index is very simple:

#include <seqan3/search/fm_index/bi_fm_index.hpp> // for using the bi_fm_index
#include <seqan3/search/fm_index/fm_index.hpp> // for using the fm_index
int main()
{
std::string text{"Garfield the fat cat without a hat."};
seqan3::fm_index index{text}; // unidirectional index on single text
seqan3::bi_fm_index bi_index{text}; // bidirectional index on single text
}

You can also index text collections (e.g. genomes with multiple chromosomes or protein databases):

std::vector<std::string> text{{"Garfield the fat cat without a hat."},
{"He is infinite, he is eternal."},
{"Yet another text I have to think about."}};
seqan3::fm_index index{text};
seqan3::bi_fm_index bi_index{text};

The indices can also be stored and loaded from disk by using cereal.

#include <fstream> // for writing/reading files
#include <cereal/archives/binary.hpp> // for storing/loading indices via cereal
std::string text{"Garfield the fat cat without a hat."};
seqan3::fm_index index{text};
{
std::ofstream os{"index.file", std::ios::binary};
cereal::BinaryOutputArchive oarchive{os};
oarchive(index);
}
// we need to tell the index that we work on a single text and a `char` alphabet before loading
{
std::ifstream is{"index.file", std::ios::binary};
cereal::BinaryInputArchive iarchive{is};
iarchive(index);
}

Note that in contrast to the construction via a given text, the template cannot be deduced by the compiler when using the default constructor so you have to provide template arguments.

Assignment 1

You are given the text

dna4_vector text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};

Create a seqan3::fm_index over the reference, store the index and load the index into a new seqan3::fm_index object. Print whether the indices are identical or differ.

Solution

#include <fstream>
#include <cereal/archives/binary.hpp>
using seqan3::operator""_dna4;
int main()
{
seqan3::dna4_vector
text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
seqan3::fm_index index{text};
{
std::ofstream os{"index.file", std::ios::binary};
cereal::BinaryOutputArchive oarchive{os};
oarchive(index);
}
// we need to tell the index that we work on a single text and a `dna4` alphabet before loading
{
std::ifstream is{"index.file", std::ios::binary};
cereal::BinaryInputArchive iarchive{is};
iarchive(index2);
}
if (index == index2)
std::cout << "The indices are identical!\n";
else
std::cout << "The indices differ!\n";
}

Expected output:

The indices are identical!

Search

Using an index, we can now conduct searches for a given query. In this part we will learn how to search exactly, allow substitutions and indels, and how to configure what kind of results we want, e.g. all results vs. only the best result.

Terminology

Hit

If a query can be found in the reference with the specified error configuration, we denote this result as a hit. If the query is found in the reference sequence without any errors, we call this an exact hit, otherwise, if it contains errors, it's an approximate hit.

A hit is stored in a seqan3::search_result

Template Parameters
query_id_typeThe type of the query_id; must model std::integral.
cursor_typeThe type of the cursor; must model seqan3::fm_index_cursor_specialisation.
reference_id_typeThe type of the reference_id; must model std::integral.
reference_begin_position_typeThe type of the reference_begin_position; must model std::integral.

The seqan3::search algorithm returns a range of hits. A single hit is stored in a seqan3::search_result. By default, the search result contains the query id, the reference id where the query matched and the begin position in the reference where the query sequence starts to match the reference sequence. Those information can be accessed via the respective member functions.

The following member functions exist:

Note that the index cursor is not included in a hit by default. If you are trying to use the respective member function, a static_assert will prevent you from doing so. You can configure the result of the search with the output configuration.

Searching for exact hits

We can search for all exact hits using seqan3::search:

#include <seqan3/core/debug_stream.hpp> // pretty printing
using namespace std::string_literals; // for using the ""s string literal
int main()
{
std::string text{"Garfield the fat cat without a hat."};
seqan3::fm_index index{text};
seqan3::debug_stream << search("cat"s, index) << '\n'; // [<query_id:0, reference_id:0, reference_pos:17>]
}

You can also pass multiple queries at the same time:

std::string text{"Garfield the fat cat without a hat."};
seqan3::fm_index index{text};
std::vector<std::string> query{"cat"s, "hat"s};
seqan3::debug_stream << search(query, index) << '\n';
// prints: [<query_id:0, reference_id:0, reference_pos:17>,
// <query_id:1, reference_id:0, reference_pos:31>]

The returned result will be a vector of individual results. Each element refers to the respective query.

Assignment 2

Search for all exact occurrences of GCT in the text from assignment 1.
Print the number of hits and the positions in an ascending ordering.
Do the same for the following text collection:

std::vector<dna4_vector> text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTA"_dna4,
"ACCCGATGAGCTACCCAGTAGTCGAACTG"_dna4,
"GGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};

Solution

using seqan3::operator""_dna4;
void run_text_single()
{
seqan3::dna4_vector
text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
seqan3::fm_index index{text};
seqan3::debug_stream << "===== Running on a single text =====\n"
<< "The following hits were found:\n";
for (auto && result : search("GCT"_dna4, index))
}
void run_text_collection()
{
std::vector<seqan3::dna4_vector> text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTA"_dna4,
"ACCCGATGAGCTACCCAGTAGTCGAACTG"_dna4,
"GGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
seqan3::fm_index index{text};
seqan3::debug_stream << "===== Running on a text collection =====\n"
<< "The following hits were found:\n";
for (auto && result : search("GCT"_dna4, index))
}
int main()
{
run_text_single();
run_text_collection();
}

Expected output:

===== Running on a single text =====
There are 3 hits.
The following hits were found:
<query_id:0, reference_id:0, reference_pos:1>
<query_id:0, reference_id:0, reference_pos:41>
<query_id:0, reference_id:0, reference_pos:77>
===== Running on a text collection =====
There are 3 hits.
The following hits were found:
<query_id:0, reference_id:0, reference_pos:1>
<query_id:0, reference_id:1, reference_pos:9>
<query_id:0, reference_id:2, reference_pos:16>

Searching for approximate hits

If you want to allow for errors in your query, you need to configure the approximate search with a seqan3::search_cfg::max_error_total, seqan3::search_cfg::max_error_substitution, seqan3::search_cfg::max_error_insertion or seqan3::search_cfg::max_error_deletion object. These are constructed with absolute numbers or rates:

When you want to use more than one error type, append it to the error configuration by using the |-operator.

Note
When using >= 2 errors it is advisable to use a Bi-FM-Index since searches will be faster.

To search for either 1 insertion or 1 deletion you can write:

std::string text{"Garfield the fat cat without a hat."};
seqan3::fm_index index{text};
seqan3::debug_stream << search("cat"s, index, cfg) << '\n';
// prints: [<query_id:0, reference_id:0, reference_pos:14>,
// <query_id:0, reference_id:0, reference_pos:17>,
// <query_id:0, reference_id:0, reference_pos:18>,
// <query_id:0, reference_id:0, reference_pos:32>]

Note that the error amount/rates do not have to sum up to the total of errors allowed:

Allows 2 errors which can be any combination of 2 substituions, 1 insertion and 1 deletion.

Defining only the total will set all error types to this value.

Defining any of the error types, but no total, will set the total to the sum of all error types; if a total is set, it will act as an upper bound.

Assignment 3

Search for all occurrences of GCT in the text from assignment 1.

Allow up to 1 substitution and print the matching substring of the reference.

Hint The search will give you positions in the text. To access the corresponding subrange of the text you can use std::span:

#include <seqan3/std/span>
int main()
{
std::string text{"Garfield the fat cat without a hat."};
size_t start{2};
size_t span{3};
std::span text_view{std::data(text) + start, span}; // represent interval [2, 4]
seqan3::debug_stream << text_view << '\n'; // Prints "rfi"
}

Solution

#include <seqan3/std/span>
using seqan3::operator""_dna4;
int main()
{
seqan3::dna4_vector
text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
seqan3::fm_index index{text};
for (auto && result : search("GCT"_dna4, index, cfg))
{
seqan3::debug_stream << "At position " << result.reference_begin_position() << ": "
<< std::span{std::data(text) + result.reference_begin_position(), 3}
<< '\n';
}
}

Expected output:

There are 12 hits.
At position 1: GCT
At position 5: TCT
At position 12: GAT
At position 23: GCC
At position 36: GAT
At position 41: GCT
At position 57: ACT
At position 62: GCC
At position 75: GCG
At position 77: GCT
At position 83: GCA
At position 85: ACT

Which hits are reported?

Besides the error configuration, you can define which hits should be reported:

Any hit configuration element is appended to the error configuration by using the |-operator:

If you want to choose the hit strategy at runtime you can also use the dynamic configuration seqan3::search_cfg::hit.

The seqan3::search_cfg::strata configuration element needs an additional parameter:

If the best hit had an edit distance of 1, the strata strategy would report all hits with up to an edit distance of 3. Since in this example the total error number is set to 2, all hits with 1 or 2 errors would be reported.

Assignment 4

Search for all occurrences of GCT in the text from assignment 1.
Allow up to 1 error of any type and print the number of hits for each hit strategy (use seqan3::search_cfg::strata{1}).

Solution

using seqan3::operator""_dna4;
int main()
{
seqan3::dna4_vector
text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
seqan3::dna4_vector query{"GCT"_dna4};
seqan3::fm_index index{text};
seqan3::debug_stream << "Searching all hits\n";
auto results_all = search(query, index, cfg_all);
// Attention: results_all is a pure std::ranges::input_range,
// so after calling std::ranges::distance, you cannot iterate over it again!
seqan3::debug_stream << "There are " << std::ranges::distance(results_all) << " hits.\n";
seqan3::debug_stream << "Searching all best hits\n";
auto results_all_best = search(query, index, cfg_all_best);
// Attention: results_all_best is a pure std::ranges::input_range,
// so after calling std::ranges::distance, you cannot iterate over it again!
seqan3::debug_stream << "There are " << std::ranges::distance(results_all_best) << " hits.\n";
seqan3::debug_stream << "Searching best hit\n";
auto results_best = search(query, index, cfg_best);
// Attention: results_best is a pure std::ranges::input_range,
// so after calling std::ranges::distance, you cannot iterate over it again!
seqan3::debug_stream << "There is " << std::ranges::distance(results_best) << " hits.\n";
seqan3::debug_stream << "Searching all hits in the 1-stratum\n";
auto results_strata = search(query, index, cfg_strata);
// Attention: results_strata is a pure std::ranges::input_range,
// so after calling std::ranges::distance, you cannot iterate over it again!
seqan3::debug_stream << "There are " << std::ranges::distance(results_strata) << " hits.\n";
}

Expected output:

Searching all hits
There are 25 hits.
Searching all best hits
There are 3 hits.
Searching best hit
There is 1 hit.
Searching all hits in the 1-stratum
There are 25 hits.

One last exercise

Assignment 5

Search for all occurrences of GCT in the text from assignment 1.
Allow up to 1 error of any type and search for all occurrences with the strategy seqan3::search_cfg::hit_all_best.
Align the query to each of the found positions in the genome and print the score and alignment.
BONUS
Do the same for the text collection from assignment 2.
Hint Similarly to assignment 3, you have to extract the subrange of the text in order to align against it:

// std::span returnes the subrange [start, start + span - 1], i.e. span many elements beginning at start
// For single texts.
std::span text_view{std::data(text) + start, span};
// For text collections. idx is the current index of the text collection returned by the search.
std::span text_view{std::data(text[idx]) + start, span};

Solution

#include <seqan3/std/span>
using seqan3::operator""_dna4;
void run_text_single()
{
seqan3::dna4_vector
text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTAACCCGATGAGCTACCCAGTAGTCGAACTGGGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
seqan3::dna4_vector query{"GCT"_dna4};
seqan3::fm_index index{text};
seqan3::debug_stream << "Searching all best hits allowing for 1 error in a single text\n";
seqan3::align_cfg::aligned_ends{seqan3::free_ends_first} |
seqan3::align_cfg::result{seqan3::with_alignment};
auto search_results = search(query, index, search_config);
seqan3::debug_stream << "-----------------\n";
for (auto && hit : search_results)
{
size_t start = hit.reference_begin_position() ? hit.reference_begin_position() - 1 : 0;
std::span text_view{std::data(text) + start, query.size() + 1};
for (auto && res : align_pairwise(std::tie(text_view, query), align_config))
{
auto && [aligned_database, aligned_query] = res.alignment();
seqan3::debug_stream << "score: " << res.score() << '\n';
seqan3::debug_stream << "database: " << aligned_database << '\n';
seqan3::debug_stream << "query: " << aligned_query << '\n';
seqan3::debug_stream << "=============\n";
}
}
}
void run_text_collection()
{
std::vector<seqan3::dna4_vector> text{"CGCTGTCTGAAGGATGAGTGTCAGCCAGTGTA"_dna4,
"ACCCGATGAGCTACCCAGTAGTCGAACTG"_dna4,
"GGCCAGACAACCCGGCGCTAATGCACTCA"_dna4};
seqan3::dna4_vector query{"GCT"_dna4};
seqan3::fm_index index{text};
seqan3::debug_stream << "Searching all best hits allowing for 1 error in a text collection\n";
seqan3::align_cfg::aligned_ends{seqan3::free_ends_first} |
seqan3::align_cfg::result{seqan3::with_alignment};
seqan3::debug_stream << "-----------------\n";
for (auto && hit : search(query, index, search_config))
{
size_t start = hit.reference_begin_position() ? hit.reference_begin_position() - 1 : 0;
std::span text_view{std::data(text[hit.reference_id()]) + start, query.size() + 1};
for (auto && res : align_pairwise(std::tie(text_view, query), align_config))
{
auto && [aligned_database, aligned_query] = res.alignment();
seqan3::debug_stream << "score: " << res.score() << '\n';
seqan3::debug_stream << "database: " << aligned_database << '\n';
seqan3::debug_stream << "query: " << aligned_query << '\n';
seqan3::debug_stream << "=============\n";
}
}
}
int main()
{
run_text_single();
run_text_collection();
}

Expected output:

Searching all best hits allowing for 1 error in a single text
There are 3 hits.
-----------------
score: 0
database: GCT
query: GCT
=============
score: 0
database: GCT
query: GCT
=============
score: 0
database: GCT
query: GCT
=============
Searching all best hits allowing for 1 error in a text collection
There are 3 hits.
-----------------
score: 0
database: GCT
query: GCT
=============
score: 0
database: GCT
query: GCT
=============
score: 0
database: GCT
query: GCT
=============

debug_stream.hpp
Provides seqan3::debug_stream and related types.
dna4.hpp
Provides seqan3::dna4, container aliases and string literals.
span
Provides std::span from the C++20 standard library.
fstream
std::string
align_pairwise.hpp
Provides pairwise alignment function.
seqan3::fm_index
The SeqAn FM Index.
Definition: fm_index.hpp:193
seqan3::search_cfg::max_error_substitution
Configuration element that represents the number or rate of substitution errors.
Definition: max_error.hpp:53
std::vector< std::string >
seqan3::search_cfg::max_error_total
Configuration element that represents the number or rate of total errors.
Definition: max_error.hpp:35
std::search
T search(T... args)
seqan3::align_pairwise
constexpr auto align_pairwise(sequence_t &&seq, alignment_config_t const &config)
Computes the pairwise alignment for a pair of sequences or a range over sequence pairs.
Definition: align_pairwise.hpp:139
seqan3::align_cfg::result
Sets the result of the alignment computation.
Definition: align_config_result.hpp:136
seqan3::search_cfg::max_error_deletion
Configuration element that represents the number or rate of deletion errors.
Definition: max_error.hpp:92
seqan3::search_cfg::max_error_insertion
Configuration element that represents the number or rate of insertion errors.
Definition: max_error.hpp:73
bi_fm_index.hpp
Provides the bidirectional seqan3::bi_fm_index.
seqan3::bi_fm_index
The SeqAn Bidirectional FM Index.
Definition: bi_fm_index.hpp:65
std::tie
T tie(T... args)
std::cout
seqan3::configuration
Collection of elements to configure an algorithm.
Definition: configuration.hpp:81
seqan3::search_cfg::error_count
A strong type of underlying type uint8_t that represents the number of errors.
Definition: max_error_common.hpp:27
std::ofstream
search.hpp
Provides the public interface for search algorithms.
seqan3::align_cfg::aligned_ends
The configuration for aligned sequence ends.
Definition: align_config_aligned_ends.hpp:548
seqan3::search_cfg::hit_strata
Configuration element to receive all hits with the best number of errors plus the given stratum....
Definition: hit.hpp:89
all.hpp
Meta-header for the alignment configuration module .
seqan3::align_cfg::result::result
result(alignment_result_tag_t) -> result< alignment_result_tag_t >
Deduces the alignment result from the given constructor argument.
seqan3::debug_stream
debug_stream_type debug_stream
A global instance of seqan3::debug_stream_type.
Definition: debug_stream.hpp:42
seqan3::search
auto search(queries_t &&queries, index_t const &index, configuration_t const &cfg=search_cfg::default_configuration)
Search a query or a range of queries in an index.
Definition: search.hpp:104
seqan3::align_cfg::edit
constexpr configuration edit
Shortcut for edit distance configuration.
Definition: align_config_edit.hpp:52
seqan3::search_cfg::hit_single_best
constexpr detail::hit_single_best_tag hit_single_best
Configuration element to receive a single best hit with the lowest number of errors within the error ...
Definition: hit.hpp:82
fm_index.hpp
Provides the unidirectional seqan3::fm_index.
std::data
T data(T... args)
seqan3::search_cfg::hit_all
constexpr detail::hit_all_tag hit_all
Configuration element to receive all hits within the error bounds.
Definition: hit.hpp:68
seqan3::search_cfg::hit_all_best
constexpr detail::hit_all_best_tag hit_all_best
Configuration element to receive all hits with the lowest number of errors within the error bounds.
Definition: hit.hpp:75
std::ifstream