Sequences
Sequences in SeqAn.
Sequences are the core of SeqAn. This tutorial gives an overview of the variety of sequences types in SeqAn and basic manipulation algorithm.
1 Sequence Classes
1.1 Strings
Strings are containers that store sequences of values, for example of char, nucleotides, or amino acids. The user of SeqAn can choose between several kinds of strings (see below for the different string implementations SeqAn offers) and different values (i.e. alphabets). There is no restriction for the use of any combination of values and kinds of strings. Typically, the value type is qualified as the first template argument, i.e. in angled brackets:
String<AminoAcid> myProteine;
SeqAn predefines shortcuts for some usual value types, so we can also write:
Peptide myProteine;
SeqAn offers many functions and operators for initializing, converting, manipulating, and printing strings. For example:
String<char> str = "this is ";
str += "a test.";
cout << str;                    //output: "this is a test."
cout << length(str);            //output: 15
For more examples see this demo program.
The user can specify the kind of string that should be used in an optional second template argument of String. There are the following kinds of string:
Specialization
Description
Applications
Limitations
Alloc String Expandable string that is stored on heap. The default string implementation that can be used for general purposes. Changing the capacity can be very costly since all values must be copied.
Array String Fast but non-expandable string. Fast storing of small sequences. Capacity must already be known at runtime. Not suitable for storing large sequences.
Block String String that stores its content in blocks The capacity of the string can quickly be increased. Good choice for growing strings and for stacks. Iteration and random access to values is slightly slower than for Alloc Strings.
Packed String A string that stores as many values in one machine word as possible. Suitable for storing large strings in memory. Slower than other in-memory strings.
External String String that is stored in secondary memory. Suitable for storing very large strings (>2GB). Parts of the string are automatically loaded from secondary memory when needed. Slower than other string classes.
CStyle String Allows adaption of strings to C-style strings. Used for transforming other String classes into C-style strings (i.e. null terminated char arrays). Could be usefull for calling functions of C-libraries. Only reasonable if value type is char or wchar_t.
For example:
String<char, Array<100> > myArrayString;    //string with maximum length 100
String<Dna, Packed<> > myPackedString;      //string that takes only 2 bits per nucleotide
1.2 Sequence Adaptions
SeqAn offers an interface for accessing standard library strings and c-style char arrays. Hence those built-in types can be handled in the same way as SeqAn strings. For example:
std::basic_string str1 = "a standard library string";
cout << length(str1);           //output: 25

char str2[] = "this is a char array";
cout << length(str2);           //output: 20
1.3 Segments
Segments are sequences that represent parts of other sequences. There are three kinds of segments in SeqAn: infixes, prefixes, and suffixes. The metafunctions Infix, Prefix, and Suffix, respectively, return for a given sequence an appropriate data type for storing the segment. For example:
Peptide prot = "AAADDDEEE";
Suffix<Peptide>::Type suf = suffix(prot, 3);
cout << suf;                                    //output: "DDDEEE"
The segment does not create a copy of the sequence. Changing the segment also changes its host sequence, for example:
CharString str = "start_middle_end";
infix(str, 6, 12) = "overwrite";                //overwrites "middle"
cout << str;                                    //output: "start_overwrite_end";
If this effect is undesirable, one has to explicitely make a copy of the string.
2 Working with Sequences
2.1 Iterators
Iterators are objects that are used to scan through containers like strings or segments. For a given container class the metafunction Iterator returns an appropriate iterator. An iterator always points to one value in the container. The function value (which does the same as the operator *) can be used to access this value. Functions like goNext or goPrevious (which do the same as ++ and --, respectively) can be used to move the iterator to other values within the container.
The functions begin and end applied to a container return iterators to the begin and the end of the container. Note that, similar to C++ standard library iterators, the iterator returned by end does not points to the last value of the container but to the value that would come next. So if s is empty, then end(s) == begin(s).
The following code that prints out a sequence demonstrates a typical iteration through a string:
String<char> str = "acgt";
typedef Iterator<String<char> >::Type TIterator;
for (TIterator it = begin(str); it != end(str); ++it)
{
    cout << value(it);
}
See this demo program for more examples.
2.2 Comparisons
Two sequences can be lexicographically compared using usual operators like < or >=, for example:
String<char> a      = "beta";
String<char> b      = "alpha";

bool a_not_equals_b = (a != b);     //true
bool a_less_b       = (a < b);      //false
Each comparison involves a scan of the two sequences for searching the first mismatch between the strings. This could be costly if the two sequences share a long common prefix. Suppose we want to branch in a program depending on whether a < b, a == b, or a > b, for example:
if (a < b)      { /* code for case "a < b"  */ }
else if (a > b) { /* code for case "a > b"  */ }
else            { /* code for case "a == b" */ }
In this case, although only one scan would be enough to decide what case is to be applied, each operator > and < performs a new comparison. SeqAn offers lexicals to avoid unnecessary sequence scans. Lexicals can store the result of a comparison, for example:
Lexical<> comp(a, b);   //compare a and b and store the result in comp

if (isLess(comp))           { /* code for case "a < b"  */ }
else if (isGreater(comp))   { /* code for case "a > b"  */ }
else                        { /* code for case "a == b" */ }
2.3 Expansion
Each sequence object has a capacity, i.e. the maximum length of a sequence that can be stored in this object. While some sequence types like Array String or char array have a fixed capacity, the capacity of other sequence classes like Alloc String or std::basic_string can be changed at runtime. Capacity can either be set explicitly by functions like reserve or resize, or implicitly changed if a function like append or replace has a result that would be too long for the target string. There are several overflow strategies that determine what actually happens when a string should be expanded beyond its capacity. If no overflow strategy is specified for a function call, a default overflow strategy is selected depending on the type of the sequence.
Example:
String<char> str;
resize(str, 5, Exact());                    //sets capacity of str to 5
assign(str, "abcdefghijklmn", Limit());     //only "abcde" is assigned to str, since str is limited to 5
append(str, "ABCDEFG");                     //Use default expansion strategy: now str == "abcdeABCDEFG"
2.4 Conversion
A sequence of type A values can be converted into a sequence of type B value, if A can be converted into B. SeqAn offers three different ways for conversion:
1. Copy conversion. The source sequence is copied into the target sequence. This can be done by assignment (operator =) or using the function assign, for example:
String<Dna> source = "acgtgcat";
String<char> target;
assign(target, source);     //copy conversion is done here
2. Move conversion. If the source sequence is not needed any more after the conversion, it is always advisable to use move instead of assign. move need not to make a copy but can re-use the source sequence storage. In some cases, move can also perform an in-place conversion, for example:
String<Dna> source = "acgtgcat";
String<char> target;
move(target, source);       //the in-place move conversion is done here
3. Modifier conversion. Instead of creating an actual target sequence, use a modifier (see the tutorial) to 'emulate' a sequence with a different value type, i.e. the modifier target in the following example behaves exactly like a char sequence:
String<Dna> source = "acgtgcat";
typedef Modifier<String<Dna>, ModView<FunctorConvert<Dna, char> > > TDnaToCharModifier;
TDnaToCharModifier target(source);      //this is a sequence of char that contains "acgtgcat"
Value<TDnaToCharModifier>::Type c;      //defines a variable of type char
2.5 Input/Output
SeqAn offers several Ways for loading and saving sequences in many formats. For more information, see here.
3 Gapped Sequences
SeqAn contains a special class Gaps for storing sequences that contain gaps, e.g. lines in sequence alignments. See here for more information.
SeqAn - Sequence Analysis Library - www.seqan.de