Finding motifs in SeqAn.
Motifs are short sequence patterns of biological significance in either DNA, RNA or protein sequences. The discovery of such motifs is an important task in molecular biology. The characterization and localization of motifs is a fundamental approach to a better understanding of the structure, function and evolutionary relationships of the corresponding genes or proteins.
The tutorial will focus on the implemented Motif Finder which was integrated in SeqAn. The Motif Finder, consisting of the class template MotifFinder and the function template findMotif, provides four different motif finding algorithms, the two heuristic algorithms, PROJECTION and ePatternBranching, and the two exact algorithms, PMS1 and PMSP. In the following, the tutorial will shortly explain the structure of MotifFinder and findMotif and give some instructions on how to apply them in order to start a motif search.
The MotifFinder class template holds the parameters used for running the appropriate motif search algorithm, and furthermore serves to store some useful information about each of the motifs discovered by the algorithm, including the pattern of the found motif and its corresponding score which is the number of query sequences containing a motif instance.
MotifFinder has two template parameters represented by
TSeqType and TAlgorithm.
The first template parameter TSeqType determines the type of sequences to be analyzed.
Two possible types of sequences are, for example, Dna and protein sequences.
MotifFinder<AminoAcid, TAlgorithm> finder;
The second template parameter
TAlgorithm specifies the desired motif search algorithm for finding motifs in a set of
sequences. Four different motif finding algorithms are available: Projection,
ePatternBranching, PMS1 and PMSP. Depending on the chosen parameter TAlgorithm
there are different MotifFinder constructors with different arguments.
The Projection Motif Finder denotes the partial specialized MotifFinder with Projection as second template parameter.
The PROJECTION algorithm of Buhler and Tompa is a randomized algorithm which does not guarantee that the unknown motif will be found every time. The chance of success can be increased by performing a large number of independent trials to generate multiple guesses. In each trial, the implemented PROJECTION algorithm makes a preselection of sets of length-
called l-mers which are likely to be a collection of motif instances and refines them by using the EM algorithm of Bailey
and Elkan. In addition to the basic parameters t (number of query sequences), l (length of motif),
m_total (total number of possible l-mers ( m_total = t*(n-l+1), if all sequences have the same sequence length)),
d (number of substitutions) and the boolean parameter is_exact (size of Hamming distance), PROJECTION has the three key
parameters k (projection size), s (bucket threshold) and tr (number of independent trials).
MotifFinder<TSeqType, Projection> finder(t,l,m_total,d,is_exact);
MotifFinder<TSeqType, Projection> finder(t,l,m_total,d,is_exact,k,s,tr);
The Projection Motif Finder provides different constructors as shown above. When creating a Projection Motif Finder object by calling the constructor without the key parameters
k, s and tr
as constructor arguments the key parameters are internally computed.
The ePatternBranching Motif Finder denotes the partial specialized MotifFinder with EPatternBranching as second template parameter.
The ePatternBranching algorithm of Davila and Rajasekaran is an extended version of the well-known heuristic PatternBranching algorithm. This pattern-based algorithm searches in the space of possible motifs. Starting from each
l-mer x in the
input sequences, the algorithm iteratively searches around the vicinities of x and finds the best neighbors by applying
the function bestNeighbors which selects at the end of each step those patterns from the set of best neighbors that
fulfill a particular condition and that are therefore qualified for being a motif instance. In its efficient version,
ePatternBranching first generates all patterns in the Hamming distance h-neighborhood of x. h is an integer value
passed as an input argument to the algorithm and represents the size of the neighborhood which is processed at first.
MotifFinder<TSeqType, EPatternBranching> finder(t,l,d,is_exact,h);
MotifFinder<TSeqType, EPatternBranching> finder(t,l,d,is_exact,h,n_ar);
As shown above, the ePatternBranching Motif Finder provides three different constructors with different arguments. The parameter
h can be passed either directly as constructor argument or it is internally computed.
In the latter case, it is necessary to pass an integer array ( n_ar) for the computation of h containing values which
represent the respective sequence lengths of each input sequence.
The PMS1 Motif Finder denotes the partial specialized MotifFinder with PMS1 as second template parameter.
The PMS1 algorithm developed by Rajasekaran et al. searches in the space of possible motifs such as the ePatternBranching algorithm. The procedure of the PMS1 algorithm is quite simple. For every
l-mer x in each input sequence the algorithm
generates all possible length- l patterns in the Hamming distance d-neighborhood of x. The neighbor sets for each
sequence are then intersected so that at the end of the process we get a group of l-mers or a single l-mer that occurs
in each input sequence with exactly d substitutions.
MotifFinder<TSeqType, PMS1> finder(l,d,is_exact);
The non-default constructor has the three constructor arguments
l (length of motif),
d (number of substitutions) and the boolean parameter is_exact (size of Hamming distance).
The PMSP Motif Finder denotes the partial specialized MotifFinder with PMSP as second template parameter.
The PMSP algorithm of Davila et al. is an improvement of the PMS1 algorithm. It examines each possible
l-mer of the first
input sequence, explores its neighborhood and finally checks whether an l-mer in the neighborhood is a motif instance.
MotifFinder<TSeqType, PMSP> finder(l,d,is_exact);
The non-default constructor of the PMSP Motif Finder has the same three arguments
l, d and is_exact
as the PMS1 Motif Finder.
|A group of DNA or protein sequences (the training set). (Types: StringSet<String<Dna> > (DnaString), StringSet<String<AminoAcid> > (Peptide))|
|A model type for sequence data. (The type of motif distribution to assume.) (Types:|
The function findMotif starts the search for noticeable motif patterns within the sequences in the
It has three input parameters, finder, dataset and sequence_model as shown above, where the object instance finder
of type MotifFinder specifies the algorithm which will be used to solve the motif search problem.
dataset represents the set of sequences which is analyzed for motif patterns that are shared among the
sequences. The sequences are, for example, DNA or protein sequences.
SeqAn provides different types for the representation of nucleotides and amino acids.
Depending on the chosen Motif Finder there are different types for sequence data (
sequence_model) which can be specified
by the user.
|OOPS, OMOPS, ZOOPS, TCM|
|OOPS, OMOPS, ZOOPS, TCM|
The Projection Motif Finder is able to run in OOPS, OMOPS, ZOOPS and TCM mode, while the ePatternBranching Motif Finder only supports the two sequence models, OOPS and OMOPS. The PMS1 Motif Finder and the PMSP Motif Finder, on the other hand, are able to run in all the four modes OOPS, OMOPS, ZOOPS and TCM.
The motif finding process in SeqAn consists of four main steps. Before going into the details of each of the steps, let us first consider the following motif finding example.
Given a set of five DNA sequences, can we find the unknown motif
M of length 10 which
occurs once in each sequence with exactly two substitutions?
Here is the solution of our motif finding example.
The following sections describe each of the four main steps in more detail and show how to use the Seqan's Motif Finder in order to find the unknown motif
M =GGTGTATAAA in the above example.
Step 1: Defining the sample dataset
Before starting a motif search a sample set of sequences (e.g. DNA or protein sequences) must be defined which is believed to have highly conserved sequence motifs at unknown positions in the sample sequences. In the context of DNA sequences, for example, the upstream sequences of co-regulated genes which are controlled by the same regulatory mechanism can form a possible sample set.
The user of SeqAn can specify the input dataset for which the task is to be performed in a number of ways. For short sequences, the easiest way is the explicit definition of the sample sequences which must be all of the same sequence type. For example:
Another possibility is to read the input sequences from a data file. For example:
//are stored in the data file "motif_finding_example.fa".
unsigned int t = 5;
FILE * file1 = fopen("motif_finding_example.fa", "r");
Align<DnaString, ArrayGaps> align;
read(file1, align, FastaAlign());
//Converting Align object into a StringSet object!
for(unsigned int i=0; i<t; ++i)
Step 2: Choosing the appropriate Motif Finder
As mentioned before, SeqAn provides the four Motif Finder types: Projection Motif Finder, ePatternBranching Motif Finder, PMS1 Motif Finder and PMSP Motif Finder. Each of them represents one specific motif finding algorithm of the same name. If the user is interested in examining all possible motifs which occur in the sample sequences, it is advisable to choose the PMS1 or the PMSP Motif Finder, because the other two Motif Finder types, Projection Motif Finder and ePatternBranching Motif Finder, are both approximation algorithms.
|Projection||The Projection algorithm is able to handle both short and longer motifs within a reasonable time.|
|ePatternBranching||The ePatternBranching algorithm works with good experimental success rates and does not
show any considerable loss of accuracy when comparing its performance results with those
of the exact algorithms PMSP and PMS1. However, the algorithm has
a prohibitive run time,
especially for the challenging problems and for those motif finding problems having a set of
|PMS1||The PMS1 algorithm has a run time similar to PMSP and a lower run time than ePatternBranching.
It works well for values of |
|PMSP||The PMSP algorithm is more space efficient than PMS1 and can solve the motif problems by making use of
significantly less memory. But its running time significantly increases with the size of the motif's length
Step 3: Performing the motif search using function findMotif
In order to begin the motif search, the function template findMotif must be applied to the input dataset. Since instances of the variant motif are planted exactly once in each sample sequence, we set the third input parameter
sequence_model to OOPS.
Step 4: Handling results obtained from the motif finding algorithm
SeqAn offers the function displayResult which is used to display all motif candidates found by the appropriate algorithm. The function only has one parameter which must be of type MotifFinder.
For more examples see this demo program.
SeqAn - Sequence Analysis Library - www.seqan.de