Discovering almost any hidden motif from multiple sequences

Authors:
Bin Fu;Ming-Yang Kao;Lusheng Wang
Affiliations:
University of Texas--Pan American, Edinburg, TX;Northwestern University, Evanston, IL;The City University of Hong Kong, Hong Kong
Venue:
ACM Transactions on Algorithms (TALG)
Year:
2011

Citing 9
Cited 0

Fast algorithms for approximately counting mismatches

Information Processing Letters
Randomized algorithms

Randomized algorithms
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Finding similar regions in many strings

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Efficient approximation algorithms for the Hamming center problem

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
On the closest string and substring problems

Journal of the ACM (JACM)
Combinatorial Approaches to Finding Subtle Signals in DNA Sequences

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Distinguishing string selection problems

Information and Computation
Probabilistic Analysis of a Motif Discovery Algorithm for Multiple Sequences

SIAM Journal on Discrete Mathematics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study a natural probabilistic model for motif discovery. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet Σ. A motif G=g1g2… gm is a string of m characters. Each background sequence is implanted with a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b1b2… bm of G, every character is probabilistically generated such that the probability for bi&neq; gi is at most α. In this article, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet Σ with |Σ|≥ 2 and is applicable to DNA motif discovery. We prove that for α c0, &epsis;, and δ2 such that if there are at least c0 log n input sequences, then in O(n2/h(log n)O(1)) time this algorithm finds the motif with probability at least 3/4 for every G∈ Σρ-Ψρ, h,&epsis;(Σ), where n the length of longest sequences, ρ is the length of the motif, h is a parameter with ρ≥ 4h≥ δ2log n, and Ψρ, h,&epsis;(Σ) is a small subset of at most 2−Θ(&epsis;2 h) fraction of the sequences in Σρ.