Discovering almost any hidden motif from multiple sequences

  • Authors:
  • Bin Fu;Ming-Yang Kao;Lusheng Wang

  • Affiliations:
  • University of Texas--Pan American, Edinburg, TX;Northwestern University, Evanston, IL;The City University of Hong Kong, Hong Kong

  • Venue:
  • ACM Transactions on Algorithms (TALG)
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study a natural probabilistic model for motif discovery. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet Σ. A motif G=g1g2… gm is a string of m characters. Each background sequence is implanted with a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b1b2… bm of G, every character is probabilistically generated such that the probability for bi&neq; gi is at most α. In this article, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet Σ with |Σ|≥ 2 and is applicable to DNA motif discovery. We prove that for α c0, &epsis;, and δ2 such that if there are at least c0 log n input sequences, then in O(n2/h(log n)O(1)) time this algorithm finds the motif with probability at least 3/4 for every G∈ Σρ-Ψρ, h,&epsis;(Σ), where n the length of longest sequences, ρ is the length of the motif, h is a parameter with ρ≥ 4h≥ δ2log n, and Ψρ, h,&epsis;(Σ) is a small subset of at most 2−Θ(&epsis;2 h) fraction of the sequences in Σρ.