Motif Search
Finding motifs in SeqAn.
Motifs are short sequence patterns of biological significance in either DNA, RNA or protein sequences. The discovery of such motifs is an important task in molecular biology. The characterization and localization of motifs is a fundamental approach to a better understanding of the structure, function and evolutionary relationships of the corresponding genes or proteins.
The Motif Finder, consisting of the class MotifFinder and the function findMotif, provides four different motif finding algorithms, the two heuristic algorithms, PROJECTION and ePatternBranching, and the two exact algorithms, PMS1 and PMSP. In the following, the tutorial will shortly explain the structure of MotifFinder and findMotif and give some instructions on how to apply them in order to start a motif search.
1 MotifFinder class
The MotifFinder class template holds the parameters used for running the appropriate motif search algorithm, and furthermore serves to store some useful information about each of the motifs discovered by the algorithm, including the pattern of the found motif and its corresponding score which is the number of query sequences containing a motif instance.
MotifFinder<TSeqType, TAlgorithm> finder;
MotifFinder has two template parameters represented by TSeqType and TAlgorithm. The first template parameter TSeqType determines the type of sequences to be analyzed. Two possible types of sequences are, for example, Dna and protein sequences.
MotifFinder<Dna, TAlgorithm> finder;
MotifFinder<AminoAcid, TAlgorithm> finder;
The second template parameter TAlgorithm specifies the desired motif search algorithm for finding motifs in a set of sequences. Four different motif finding algorithms are available: Projection, ePatternBranching, PMS1 and PMSP. Depending on the chosen parameter TAlgorithm there are different MotifFinder constructors with different arguments.
1.1 Projection Motif Finder
The Projection Motif Finder denotes the partial specialized MotifFinder with Projection as second template parameter.
The PROJECTION algorithm of Buhler and Tompa is a randomized algorithm which does not guarantee that the unknown motif will be found every time. The chance of success can be increased by performing a large number of independent trials to generate multiple guesses. In each trial, the implemented PROJECTION algorithm makes a preselection of sets of length-l patterns called l-mers which are likely to be a collection of motif instances and refines them by using the EM algorithm of Bailey and Elkan. In addition to the basic parameters t (number of query sequences), l (length of motif), m_total (total number of possible l-mers (m_total = t*(n-l+1), if all sequences have the same sequence length)), d (number of substitutions) and the boolean parameter is_exact (size of Hamming distance), PROJECTION has the three key parameters k (projection size), s (bucket threshold) and tr (number of independent trials).
MotifFinder<TSeqType, Projection> finder;
MotifFinder<TSeqType, Projection> finder(t,l,m_total,d,is_exact);
MotifFinder<TSeqType, Projection> finder(t,l,m_total,d,is_exact,k,s,tr);
The Projection Motif Finder provides different constructors as shown above. When creating a Projection Motif Finder object by calling the constructor without the key parameters k, s and tr as constructor arguments the key parameters are internally computed.
1.2 ePatternBranching Motif Finder
The ePatternBranching Motif Finder denotes the partial specialized MotifFinder with EPatternBranching as second template parameter.
The ePatternBranching algorithm of Davila and Rajasekaran is an extended version of the well-known heuristic PatternBranching algorithm. This pattern-based algorithm searches in the space of possible motifs. Starting from each l-mer x in the input sequences, the algorithm iteratively searches around the vicinities of x and finds the best neighbors by applying the function bestNeighbors which selects at the end of each step those patterns from the set of best neighbors that fulfill a particular condition and that are therefore qualified for being a motif instance. In its efficient version, ePatternBranching first generates all patterns in the Hamming distance h-neighborhood of x. h is an integer value passed as an input argument to the algorithm and represents the size of the neighborhood which is processed at first.
MotifFinder<TSeqType, EPatternBranching> finder;
MotifFinder<TSeqType, EPatternBranching> finder(t,l,d,is_exact,h);
MotifFinder<TSeqType, EPatternBranching> finder(t,l,d,is_exact,h,n_ar);
As shown above, the ePatternBranching Motif Finder provides three different constructors with different arguments. The parameter h can be passed either directly as constructor argument or it is internally computed. In the latter case, it is necessary to pass an integer array (n_ar) for the computation of h containing values which represent the respective sequence lengths of each input sequence.
1.3 PMS1 Motif Finder
The PMS1 Motif Finder denotes the partial specialized MotifFinder with PMS1 as second template parameter.
The PMS1 algorithm developed by Rajasekaran et al. searches in the space of possible motifs such as the ePatternBranching algorithm. The procedure of the PMS1 algorithm is quite simple. For every l-mer x in each input sequence the algorithm generates all possible length-l patterns in the Hamming distance d-neighborhood of x. The neighbor sets for each sequence are then intersected so that at the end of the process we get a group of l-mers or a single l-mer that occurs in each input sequence with exactly d substitutions.
MotifFinder<TSeqType, PMS1> finder;
MotifFinder<TSeqType, PMS1> finder(l,d,is_exact);
The non-default constructor has the three constructor arguments l (length of motif), d (number of substitutions) and the boolean parameter is_exact (size of Hamming distance).
1.4 PMSP Motif Finder
The PMSP Motif Finder denotes the partial specialized MotifFinder with PMSP as second template parameter.
The PMSP algorithm of Davila et al. is an improvement of the PMS1 algorithm. It examines each possible l-mer of the first input sequence, explores its neighborhood and finally checks whether an l-mer in the neighborhood is a motif instance.
MotifFinder<TSeqType, PMSP> finder;
MotifFinder<TSeqType, PMSP> finder(l,d,is_exact);
The non-default constructor of the PMSP Motif Finder has the same three arguments l, d and is_exact as the PMS1 Motif Finder.
2 Function template findMotif
findMotif(finder, dataset, sequence_model);
finderA MotifFinder object. (Types: Projection Motif Finder, ePatternBranching Motif Finder, PMS1 Motif finder, PMSP Motif Finder)
datasetA group of DNA or protein sequences (the training set). (Types: StringSet<String<Dna> > (DnaString), StringSet<String<AminoAcid> > (Peptide))
sequence_modelA model type for sequence data. (The type of motif distribution to assume.) (Types:OOPS, OMOPS, ZOOPS, TCM)
The function findMotif starts the search for noticeable motif patterns within the sequences in the dataset. It has three input parameters, finder, dataset and sequence_model as shown above, where the object instance finder of type MotifFinder specifies the algorithm which will be used to solve the motif search problem.
2.1 Query sequences
The parameter dataset represents the set of sequences which is analyzed for motif patterns that are shared among the sequences. The sequences are, for example, DNA or protein sequences. SeqAn provides different types for the representation of nucleotides and amino acids.
2.2 Sequence model
Depending on the chosen Motif Finder there are different types for sequence data (sequence_model) which can be specified by the user.
Type of MotifFinder
Possible options for 'sequence_model'
MotifFinder<TSeqType, Projection>OOPS, OMOPS, ZOOPS, TCM
MotifFinder<TSeqType, ePatternBranching>OOPS, OMOPS
MotifFinder<TSeqType, PMS1>OOPS, OMOPS
The Projection Motif Finder is able to run in OOPS, OMOPS, ZOOPS and TCM mode, while the ePatternBranching Motif Finder only supports the two sequence models, OOPS and OMOPS. The PMS1 Motif Finder and the PMSP Motif Finder, on the other hand, are able to run in all the four modes OOPS, OMOPS, ZOOPS and TCM.
3 Performing the search for motif occurrences in SeqAn
The motif finding process in SeqAn consists of four main steps. Before going into the details of each of the steps, let us first consider the following motif finding example.
Given a set of five DNA sequences, can we find the unknown motif M of length 10 which occurs once in each sequence with exactly two substitutions?
Here is the solution of our motif finding example.
The following sections describe each of the four main steps in more detail and show how to use the Seqan's Motif Finder in order to find the unknown motif M =GGTGTATAAA in the above example.
Step 1: Defining the sample dataset
Before starting a motif search a sample set of sequences (e.g. DNA or protein sequences) must be defined which is believed to have highly conserved sequence motifs at unknown positions in the sample sequences. In the context of DNA sequences, for example, the upstream sequences of co-regulated genes which are controlled by the same regulatory mechanism can form a possible sample set.
The user of SeqAn can specify the input dataset for which the task is to be performed in a number of ways. For short sequences, the easiest way is the explicit definition of the sample sequences which must be all of the same sequence type. For example:
String<String<Dna> > dataset;
Alternatively, the input sequences can be read from a file.
Step 2: Choosing the appropriate Motif Finder
As mentioned before, SeqAn provides the four Motif Finder types: Projection Motif Finder, ePatternBranching Motif Finder, PMS1 Motif Finder and PMSP Motif Finder. Each of them represents one specific motif finding algorithm of the same name. If the user is interested in examining all possible motifs which occur in the sample sequences, it is advisable to choose the PMS1 or the PMSP Motif Finder, because the other two Motif Finder types, Projection Motif Finder and ePatternBranching Motif Finder, are both approximation algorithms.
Type of Motif Finder
Information concerning the algorithm performance
ProjectionThe Projection algorithm is able to handle both short and longer motifs within a reasonable time.
ePatternBranchingThe ePatternBranching algorithm works with good experimental success rates and does not show any considerable loss of accuracy when comparing its performance results with those of the exact algorithms PMSP and PMS1. However, the algorithm has a prohibitive run time, especially for the challenging problems and for those motif finding problems having a set of parameters l and d which is close to challenging instances.
PMS1The PMS1 algorithm has a run time similar to PMSP and a lower run time than ePatternBranching. It works well for values of d which are lower/equal to 3. If d is greater than 3, the memory requirements of PMS1 will increase drastically so that it will not work any longer for the search for ZOOPS and TCM motifs.
PMSPThe PMSP algorithm is more space efficient than PMS1 and can solve the motif problems by making use of significantly less memory. But its running time significantly increases with the size of the motif's length l so that bigger challenging problems can not reported due to the long processing times.
MotifFinder<Dna, PMSP> motif_finder(10,2,true);
Step 3: Performing the motif search using function findMotif
In order to begin the motif search, the function template findMotif must be applied to the input dataset. Since instances of the variant motif are planted exactly once in each sample sequence, we set the third input parameter sequence_model to OOPS. For example:
Step 4: Handling results obtained from the motif finding algorithm
SeqAn offers the function displayResult which is used to display all motif candidates found by the appropriate algorithm. The function only has one parameter which must be of type MotifFinder.
For more examples see this demo program.
SeqAn - Sequence Analysis Library -