Detecting Motifs in a Large Data Set: Applying Probabilistic Insights to Motif Finding

  • Authors:
  • Christina Boucher;Daniel G. Brown

  • Affiliations:
  • David R.Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1;David R.Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1

  • Venue:
  • BICoB '09 Proceedings of the 1st International Conference on Bioinformatics and Computational Biology
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

We give a probabilistic algorithm for Consensus Sequence , a NP-complete subproblem of motif recognition, that can be described as follows: given set of l -length sequences, determine if there exists a sequence that has Hamming distance at most d from every sequence. We demonstrate that distance between a randomly selected majority sequence and a consensus sequence decreases as the size of the data set increases. Applying our probabilistic paradigms and insights to motif recognition we develop pMCL-WMR, a program capable of detecting motifs in large synthetic and real-genomic data sets. Our results show that detecting motifs in data sets increases in ease and efficiency when the size of set of sequence increases, a surprising and counter-intuitive fact that has significant impact on this deeply-investigated area.