Discovering Almost Any Hidden Motif from Multiple Sequences in Polynomial Time with Low Sample Complexity and High Success Probability

  • Authors:
  • Bin Fu;Ming-Yang Kao;Lusheng Wang

  • Affiliations:
  • Dept. of Computer Science, University of Texas --- Pan American, USA TX 78539;Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, USA IL 60208;Department of Computer Science, The City University of Hong Kong, Kowloon, Hong Kong

  • Venue:
  • TAMC '09 Proceedings of the 6th Annual Conference on Theory and Applications of Models of Computation
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet Σ . A motif G = g 1 g 2 ...g m is a string of m characters. Each background sequence is implanted a probabilistically generated approximate copy of G . For a probabilistically generated approximate copy b 1 b 2 ...b m of G , every character is probabilistically generated such that the probability for b i *** g i is at most *** . It has been conjectured that multiple background sequences can help with finding faint motifs G . In this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet Σ with |Σ | *** 2 and is applicable to DNA motif discovery. We prove that for $\alpha and any constant x *** 8, there exist positive constants c 0 , *** , *** 1 and *** 2 such that if the length ρ of motif G is at least *** 1 logn , and there are k *** c 0 logn input sequences, then in O (n 2 + kn ) time this algorithm finds the motif with probability at least $1-{1\over 2^x}$ for every $G\in \Sigma^{\rho}-\Psi_{\rho, h,\epsilon}(\Sigma)$, where ρ is the length of the motif, h is a parameter with ρ *** 4h *** *** 2 logn , and *** ρ , h ,*** (Σ ) is a small subset of at most $2^{-\Theta(\epsilon^2 h)}$ fraction of the sequences in Σ ρ . The constants c 0 , *** , *** 1 and *** 2 do not depend on x when x is a parameter of order O (logn ). Our algorithm can take any number k sequences as input.