How to align sequences and managing alignments.
1 Overview
The alignment of nucleotide, RNA or amino acid sequences is one of the most common tasks in Bioinformatics. The motivation for doing alignments is that high sequence similarity usually implies significant functional or structural similarity. The simultaneous alignment of three or more sequences is an essential precondition to phylogenetic analyses or studies about protein families. SeqAn provides two kinds of alignment data structures, a matrix-like alignment data structure and a graph-based alignment representation.
2 Computing Alignments
Basic Alignment Algorithms
Although there is a multitude of different alignment algorithms the basic syntax for performing an alignment is always the same:
TScore result = globalAlignment(TSequences, TScore, TAlgorithm)
TScore result = localAlignment(TSequences, TScore, TAlgorithm)
TSequences, TScore, and TAlgorithm are just placeholders explained below. An actual call could look like this:
int score = globalAlignment(alignment_ds, Score<int>(1,-1,-1,-2), Gotoh());
In this case the sequences are stored in alignment_ds. The chosen Gotoh algorithm uses affine gap costs and it is configured with a scoring scheme, where matches are scored +1, mismatches -1, a gap-opening -2, and a gap-extension -1. The globalAlignment call returns the score of the alignment and stores the actual alignment in alignment_ds, which could be an Alignment Graph or an Align data structure. If it is an alignment graph, its textual representation in PipMaker format is illustrated in the following figure.
If one is only interested in the score of an alignment, a StringSet that holds the sequences is sufficient. For instance, in the above example
int score = globalAlignment(string_set, Score<int>(1,-1,-1,-2), Gotoh());
would do the job. To sum it up, TSequences must be either an Align or an Alignment Graph containing the sequences, or a simple StringSet. Note that there are also other possibilities, e.g., to retrieve just the matches of an alignment or to print the traceback in the console.
int score = globalAlignment(std::cout, string_set, Score<int>(1,-1,-1,-2), Gotoh());
TScore is a scoring object. The most common scoring objects are illustrated below in conjunction with different algorithms:
Score<int> score_type = Score<int>(0,-1,-1,0);
int score0 = globalAlignment(alignment_graph, score_type, NeedlemanWunsch() );
Blosum62 score_type_blossum(-1,-11);
int score1 = globalAlignment(alignment_graph, score_type_blossum, Gotoh() );
Score<int, Pam<> > pam(250, -1, 0);
int score2 = globalAlignment(alignment_graph, pam, Hirschberg() );
Note that certain algorithms require different score types, e.g., for an affine scoring scheme Gotoh and Hirschberg work well but Needleman-Wunsch fails because it uses linear gap penalties.
By now you have seen the most common possibilities for TScore and TAlgorithm. For local alignments TAlgorithm can be substituted with SmithWaterman(), which implements the well-known Smith-Waterman algorithm.
You can find more example code about computing global and local alignments here: Global Alignments, Local Alignments.
Alignment Configurations
However, in some situations more sophisticated alignment algorithms are necessary. For example, in assembly projects one is usually not interested in ordinary global alignments but so-called overlap alignments. These require a special initialization of the dynamic programming matrix. Also semi-global alignments that try to fit one sequence into the other require adapted dynamic programming matrices. That's why graph-based alignment algorithms can be configured with an AlignConfig object that has 4 boolean parameters.
AlignConfig<TTop, TLeft, TRight, TBottom>
The parameters indicate whether the first row and / or column is initialized with 0's and whether the maximum is searched in the last row and / or column. So in total there are 2 to the power of 4 = 16 different configurations. The following code snippet initializes the first row with 0's and looks for the maximum in the last row (classical sequence fitting).
typedef String<char> TString;
typedef StringSet<TString, Dependent<> > TStringSet;
typedef Graph<Alignment<TStringSet, void> > TGraph;
TStringSet str;
TString str0("TarfieldandGarfieldarestupid.");appendValue(str, str0);
TString str1("Garfield");appendValue(str, str1);
Score<int> score_type = Score<int>(2,-1,-1,-4);
TGraph g(str);
AlignConfig<true,false,false,true> ac;
int score = globalAlignment(g, score_type, ac, Gotoh() );
Multiple sequence alignments are part of SeqAn and currently under development. If needed contact the SeqAn team.
SeqAn - Sequence Analysis Library -