Detecting Motifs in a Large Data Set: Applying Probabilistic Insights to Motif Finding

Authors:
Christina Boucher;Daniel G. Brown
Affiliations:
David R.Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1;David R.Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1
Venue:
BICoB '09 Proceedings of the 1st International Conference on Bioinformatics and Computational Biology
Year:
2009

Citing 13
Cited 3

On selecting a satisfying truth assignment (extended abstract)

SFCS '91 Proceedings of the 32nd annual symposium on Foundations of computer science
On the greedy algorithm for satisfiability

Information Processing Letters
Randomized algorithms

Randomized algorithms
Finding similar regions in many sequences

Journal of Computer and System Sciences - STOC 1999
Combinatorial Approaches to Finding Subtle Signals in DNA Sequences

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Spelling Approximate Repeated or Common Motifs Using a Suffix Tree

LATIN '98 Proceedings of the Third Latin American Symposium on Theoretical Informatics
A Probabilistic Algorithm for k-SAT and Constraint Satisfaction Problems

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
On the complexity of finding common approximate substrings

Theoretical Computer Science
The phase transition in inhomogeneous random graphs

Random Structures & Algorithms
Fast and Practical Algorithms for Planted (l, d) Motif Search

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Exploiting a theory of phase transitions in three-satisfiability problems

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1
A graph clustering approach to weak motif recognition

WABI'07 Proceedings of the 7th international conference on Algorithms in Bioinformatics
Identification of distinguishing motifs

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching

Why large CLOSEST STRING instances are easy to solve in practice

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
On the hardness of counting and sampling center strings

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
On the Hardness of Counting and Sampling Center Strings

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We give a probabilistic algorithm for Consensus Sequence , a NP-complete subproblem of motif recognition, that can be described as follows: given set of l -length sequences, determine if there exists a sequence that has Hamming distance at most d from every sequence. We demonstrate that distance between a randomly selected majority sequence and a consensus sequence decreases as the size of the data set increases. Applying our probabilistic paradigms and insights to motif recognition we develop pMCL-WMR, a program capable of detecting motifs in large synthetic and real-genomic data sets. Our results show that detecting motifs in data sets increases in ease and efficiency when the size of set of sequence increases, a surprising and counter-intuitive fact that has significant impact on this deeply-investigated area.