Discovering Almost Any Hidden Motif from Multiple Sequences in Polynomial Time with Low Sample Complexity and High Success Probability

Authors:
Bin Fu;Ming-Yang Kao;Lusheng Wang
Affiliations:
Dept. of Computer Science, University of Texas --- Pan American, USA TX 78539;Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, USA IL 60208;Department of Computer Science, The City University of Hong Kong, Kowloon, Hong Kong
Venue:
TAMC '09 Proceedings of the 6th Annual Conference on Theory and Applications of Models of Computation
Year:
2009

Citing 8
Cited 1

Randomized algorithms

Randomized algorithms
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Finding similar regions in many strings

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Distinguishing string selection problems

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Efficient approximation algorithms for the Hamming center problem

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
On the closest string and substring problems

Journal of the ACM (JACM)
Combinatorial Approaches to Finding Subtle Signals in DNA Sequences

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Efficient algorithms for model-based motif discovery from multiple sequences

TAMC'08 Proceedings of the 5th international conference on Theory and applications of models of computation

Probabilistic Analysis of a Motif Discovery Algorithm for Multiple Sequences

SIAM Journal on Discrete Mathematics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet Σ . A motif G = g 1 g 2 ...g m is a string of m characters. Each background sequence is implanted a probabilistically generated approximate copy of G . For a probabilistically generated approximate copy b 1 b 2 ...b m of G , every character is probabilistically generated such that the probability for b i *** g i is at most *** . It has been conjectured that multiple background sequences can help with finding faint motifs G . In this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet Σ with |Σ | *** 2 and is applicable to DNA motif discovery. We prove that for $\alpha and any constant x *** 8, there exist positive constants c 0 , *** , *** 1 and *** 2 such that if the length ρ of motif G is at least *** 1 logn , and there are k *** c 0 logn input sequences, then in O (n 2 + kn ) time this algorithm finds the motif with probability at least $1-{1\over 2^x}$ for every $G\in \Sigma^{\rho}-\Psi_{\rho, h,\epsilon}(\Sigma)$, where ρ is the length of the motif, h is a parameter with ρ *** 4h *** *** 2 logn , and *** ρ , h ,*** (Σ ) is a small subset of at most $2^{-\Theta(\epsilon^2 h)}$ fraction of the sequences in Σ ρ . The constants c 0 , *** , *** 1 and *** 2 do not depend on x when x is a parameter of order O (logn ). Our algorithm can take any number k sequences as input.