Learning Objective:
In this tutorial you will learn how to construct an index and conduct searches.
Difficulty | Moderate |
---|---|
Duration | 60 Minutes |
Prerequisite tutorials | Ranges (Recommended) Pairwise Alignment (only for last assignment) |
Recommended reading | FM-Index paper FM-Index on Wikipedia Bi-FM-Index paper |
Exact and approximate string matching is a common problem in bioinformatics, e.g. in read mapping. Usually, we want to search for a match of one or many small sequences (queries) in a database consisting of one or more large sequences (references). Trivial approaches such as on-line search fail at this task for large data due to prohibitive run times.
The general solution is to use a data structure called index. An index is built over the reference and only needs to be computed once for a given data set. It is used to speed up the search in the reference and can be re-used by storing it to disk and loading it again without recomputation.
There are two groups of indices: hash tables and suffix-based indices. SeqAn implements the FM-Index and Bi-FM-Index as suffix-based indices. While the Bi-FM-Index needs more space (a factor of about 2), it allows faster searches.
Given an index, SeqAn will choose the best search strategy for you. Since very different algorithms may be selected internally depending on the configuration, it is advisable to do benchmarks with your application. A rule of thumb is to use the Bi-FM-Index when allowing more than 2 errors.
This tutorial will show you how to use the seqan3::fm_index and seqan3::bi_fm_index to create indices and how to search them efficiently using seqan3::search.
With this module you can:
The results of the search can be passed on to other modules, e.g. to create an alignment.
A reference is the data you want to search in, e.g. a genome or protein database.
A query is the data you want to search for in the reference, e.g. a read or an amino acid sequence.
An index is a data structure built over the reference that allows fast searches.
The full-text minute index (FM-Index) is an index that is similar to a suffix tree, but much smaller. It is used by most state-of-the-art read mappers and aligners. You can find more information on FM-Indicies in the original publication and on Wikipedia.
The bidirectional FM-Index (Bi-FM-Index) is an extension of the FM-Index that enables faster searches, especially when allowing multiple errors. But it uses almost twice the amount of memory the FM-Index uses. You can find more information on Bi-FM-Indicies here.
Constructing a (Bi-)FM-Index is very simple:
You can also index text collections (e.g. genomes with multiple chromosomes or protein databases):
The indices can also be stored and loaded from disk by using cereal.
Note that in contrast to the construction via a given text
, the template cannot be deduced by the compiler when using the default constructor so you have to provide template arguments.
Create a seqan3::fm_index over the reference, store the index and load the index into a new seqan3::fm_index object. Print whether the indices are identical or differ.
Expected output:
Using an index, we can now conduct searches for a given query. In this part we will learn how to search exactly, allow substitutions and indels, and how to configure what kind of results we want, e.g. all results vs. only the best result.
If a query can be found in the reference with the specified error configuration, we denote this result as a match. If the query matches the reference without any errors, we call this an exact match. If a match contains errors, it's an approximate match.
We can search for all exact matches using seqan3::search:
When using a single text, the search will return a std::vector of the begin positions of matches in the reference.
For text collections, a std::vector over std::tuple where the first element represents the index of the text in the collection and the second element represents the position within the text is returned.
You can also pass multiple queries at the same time:
The returned result will be a vector of individual results. Each element refers to the respective query.
GCT
in the text from assignment 1.
Expected output:
If you want to allow for errors in your query, you need to configure the approximate search with a seqan3::search_cfg::max_error or seqan3::search_cfg::max_error_rate object. These are constructed with four parameters:
With absolute numbers (seqan3::search_cfg::max_error):
Use seqan3::search_cfg::max_error_rate with the same elements to define error rates .
To search for either 1 insertion or 1 deletion you can write:
Note that the error amount/rates do not have to sum up to the total of errors allowed:
Allows 2 errors which can be any combination of 2 substituions, 1 insertion and 1 deletion.
Defining only the total will set all error types to this value.
Defining any of the error types, but no total, will set the total to the sum of all error types; if a total is set, it will act as an upper bound.
GCT
in the text from assignment 1.
Allow up to 1 substitution and print the matching substring of the reference.
Expected output:
Besides the error configuration, you can define what kind of hits should be reported:
The mode is appended to the error configuration by using the |
-operator:
The strata mode needs an additional parameter:
If the best hit had an edit distance of 1, the strata mode would report all hits with up to an edit distance of 3. Since in this example the total error number is set to 2, all hits with 1 or 2 errors would be reported.
GCT
in the text from assignment 1.mode::strata{1}
).
Expected output:
GCT
in the text from assignment 1.
Expected output: