Modeling dependencies in protein-DNA binding sites
RECOMB '03 Proceedings of the seventh annual international conference on Research in computational molecular biology
Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
Combinatorial Approaches to Finding Subtle Signals in DNA Sequences
Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
A Statistical Method for Finding Transcription Factor Binding Sites
Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
RECOMB '04 Proceedings of the eighth annual international conference on Resaerch in computational molecular biology
Finding gapped motifs by a novel evolutionary algorithm
EvoBIO'10 Proceedings of the 8th European conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics
Hi-index | 0.00 |
In order to guarantee that the optimal motif is found, traditional pattern-driven approaches perform an exhaustive search over all candidate motifs of length l. We develop an improved pattern-driven algorithm that takes O(4llk) time, where k is the number of sequences in the sample and l is the motif length, which is independent of the length of each sequence n for large enough l and saving a factor of n in time complexity over the original pattern-driven approach. We further extend this strategy to allow arbitrary don't care positions within a motif without much decrease in solvable values of l. Testing this algorithm on a large set of yeast samples constructed from co-expressed gene clusters reveals that most biological motifs have many invariant or almost invariant positions and these positions can be used to define the motif while ignoring the other positions. This motivates the following two-stage strategy that extends the solvable values of l substantially for the pattern-driven approach: first use an O(2llkn) algorithm to exhaustively search over all candidate motifs allowing arbitrary don't care positions but disallowing mismatches, then refine these motifs by allowing a limited amount of flexibility to model the almost invariant positions. We demonstrate that this seemingly restrictive motif definition is sufficiently powerful by showing that the performance of this algorithm is comparable to the best existing motif finding algorithms on a large benchmark set of samples. A software program implementing these approaches (MotifEnumerator) is available at http://faculty.cs.tamu.edu/shsze/motifenumerator.