Searching in SeqAn.
This tutorial is about searching a known string (the needle) in another string (the haystack). The searching can be exact or inexact. SeqAn also offers methods for searching multiple needles at once.
In SeqAn, all searching is done by calling the function find, which takes at least two arguments:
Some variants of find support further arguments.
Each call of find finds only one occurrence of the needle within the haystack. The Finder can be asked for the position of the last found occurrence. Subsequent calls of find can be used to find more occurrences of the needle, until no more occurrences can be found and find returns
Search algorithms like horspool or shift-AND scan the haystack itself in order to find instances of the needle. Many of these algorithms preprocess the needle in some way. In this case, the preprocessing data is stored in the pattern object.
String<char> my_needle = "is";
Finder<String<char> > finder(my_haystack);
Pattern<String<char>, Horspool> pattern(my_needle);
while (find(finder, pattern))
::std::cout << position(finder) << ",";
Other searching methods preprocess the haystack instead of the needle. In this case, the haystack is a index.
Example: See Index Finder.
Exact matching means to find all positions at which the haystack contains the complete needle without errors. SeqAn offers several algorithms for exact searching for different purposes. The following table lists the specializations of Pattern that can be used for exact searching.
|Horspool||A common and rather simple algorithm that uses a bad character rule to speed up the search.
Works very nice in practice for most applications, though the worst case running time is |
|BFAM||The Backward Oracle Matching algorithm that uses a special automaton to scan through the haystack. A good choice for long patterns.|
|BndmAlgo||The Backward Nondeterministic DAWG (Directed Acyclic Word Graph) Matching algorithm uses another special automaton to scan through the haystack. An alternative to BFAM for long patterns.|
|ShiftOr||An algorithm that uses bit parallelism. Should only be used for patterns that are not longer than a machine word, i.e. 32 or 64 characters. Even for small patterns, it is outperformed by Horspool for alphabets larger than Dna.|
|ShiftAnd||An alternative to ShiftOr with similar characteristics.|
The following algorithms can be used to search several needles at once:
|SetHorspool||An extension of the Horspool algorithm that uses a trie to scan backwards through the haystack.|
|AhoCorasick||A multiple pattern search algorithm that uses a trie to scan forwards through the haystack.|
|MultipleShiftAnd||An extension of the ShiftAnd algorithm for searching multiple needles at once. Should only be used for patterns that are not longer than a machine word, i.e. 32 or 64 characters.|
Approximative searching means to find all occurrences of the needle not below a given score limit.
|DPSearch||A dynamic programming algorithm for approximate string-matching. Rather slow but can be applied to many kinds of scoring scheme.|
|MyersUkkonen||A fast bit-parallel approximate string matching algorithm for edit distance.|
The scoring model can be accessed by scoringScheme and setScoringScheme, the limit by scoreLimit and setScoreLimit, and the score of the last found match by getScore. All these functions are applied to the Pattern object that stores all information about the approximative search.
Note: The position stored in the Finder is the end position, not the start position, of the last found match.
Example: See Approximate Searching.
Wildcard searching algorithms make it possible to determine a more complex needle. The used syntax depends on the algorithms used. A typical wildcard pattern could be "
ab+c" that matches to " abc", " abbc", " abbbc", ....
|WildShiftAnd||An extension of ShiftAnd supports a subset of regular expressions, namely "|
Example: See Wildcard Searching.
For more information about searching with indices, see this tutorial.
, , ,
SeqAn - Sequence Analysis Library - www.seqan.de