Searching in SeqAn.
1 Overview
This tutorial is about searching a known string (the needle) in another string (the haystack). The searching can be exact or inexact. SeqAn also offers methods for searching multiple needles at once.
In SeqAn, all searching is done by calling the function find, which takes at least two arguments:

1: A finder that stores all necessary information about the haystack and the last found position of the needle within the haystack.

2: A pattern that stores all information about the needle.

Some variants of find support further arguments.

Each call of find finds only one occurrence of the needle within the haystack. The Finder can be asked for the position of the last found occurrence. Subsequent calls of find can be used to find more occurrences of the needle, until no more occurrences can be found and find returns false.
Search algorithms like horspool or shift-AND scan the haystack itself in order to find instances of the needle. Many of these algorithms preprocess the needle in some way. In this case, the preprocessing data is stored in the pattern object.


String<char> my_haystack = "Mississippi"
String<char> my_needle = "is"
Finder<String<char> > finder(my_haystack);
Pattern<String<char>, Horspool> pattern(my_needle);
while (find(finder, pattern)) 
    cout << position(finder) << ",";       //output: 1,4,
Other searching methods preprocess the haystack instead of the needle. In this case, the haystack is a index.

Example: See Index Finder.

2 Searching Algorithms
2.1 Exact String Matching
Exact matching means to find all positions at which the haystack contains the complete needle without errors. SeqAn offers several algorithms for exact searching for different purposes. The following table lists the specializations of Pattern that can be used for exact searching.
HorspoolA common and rather simple algorithm that uses a bad character rule to speed up the search. Works very nice in practice for most applications, though the worst case running time is O(n²). Not optimal for small alphabets or very long patterns.
BomAlgoThe Backward Oracle Matching algorithm that uses a special automaton to scan through the haystack. A good choice for long patterns.
BndmAlgoThe Backward Nondeterministic DAWG (Directed Acyclic Word Graph) Matching algorithm uses another special automaton to scan through the haystack. An alternative to BomAlgo for long patterns.
ShiftOrAn algorithm that uses bit parallelism. Should only be used for patterns that are not longer than a machine word, i.e. 32 or 64 characters. Even for small patterns, it is outperformed by Horspool for alphabets larger than Dna.
ShiftAndAn alternative to ShiftOr with similar characteristics.
2.2 Exact Multiple Pattern Matching
The following algorithms can be used to search several needles at once:
SetHorspoolAn extension of the Horspool algorithm that uses a trie to scan backwards through the haystack.
AhoCorasickA multiple pattern search algorithm that uses a trie to scan forwards through the haystack.
MultipleShiftAndAn extension of the ShiftAnd algorithm for searching multiple needles at once. Should only be used for patterns that are not longer than a machine word, i.e. 32 or 64 characters.
2.3 Approximative Searching
Approximative searching means to find all occurrences of the needle not below a given score limit.
DPSearchA dynamic programming algorithm for approximate string-matching. Rather slow but can be applied to many kinds of scoring scheme.
MyersUkkonenA fast bit-parallel approximate string matching algorithm for edit distance.
The scoring model can be accessed by scoringScheme and setScoringScheme, the limit by scoreLimit and setScoreLimit, and the score of the last found match by getScore. All these functions are applied to the Pattern object that stores all information about the approximative search.
Note:The position stored in the Finder is the end position, not the start position, of the last found match.
Example: See Approximate Searching.
2.4 Wildcard Searching
Wildcard searching algorithms make it possible to determine a more complex needle. The used syntax depends on the algorithms used. A typical wildcard pattern could be "ab+c" that matches to "abc", "abbc", "abbbc", ....
WildShiftAndAn extension of ShiftAnd supports a subset of regular expressions, namely "*", "+", "?", braces to specify the number of repeated characters, and character classes that are specified in square brackets.
Example: See Wildcard Searching.
3 Index Searching
For more information about searching with indices, see this tutorial.
SeqAn - Sequence Analysis Library -