Randomized algorithms
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Finding similar regions in many strings
STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Distinguishing string selection problems
Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Efficient approximation algorithms for the Hamming center problem
Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
On the closest string and substring problems
Journal of the ACM (JACM)
Combinatorial Approaches to Finding Subtle Signals in DNA Sequences
Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Efficient algorithms for model-based motif discovery from multiple sequences
TAMC'08 Proceedings of the 5th international conference on Theory and applications of models of computation
Probabilistic Analysis of a Motif Discovery Algorithm for Multiple Sequences
SIAM Journal on Discrete Mathematics
Hi-index | 0.00 |
We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet Σ . A motif G = g 1 g 2 ...g m is a string of m characters. Each background sequence is implanted a probabilistically generated approximate copy of G . For a probabilistically generated approximate copy b 1 b 2 ...b m of G , every character is probabilistically generated such that the probability for b i *** g i is at most *** . It has been conjectured that multiple background sequences can help with finding faint motifs G . In this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet Σ with |Σ | *** 2 and is applicable to DNA motif discovery. We prove that for $\alpha and any constant x *** 8, there exist positive constants c 0 , *** , *** 1 and *** 2 such that if the length ρ of motif G is at least *** 1 logn , and there are k *** c 0 logn input sequences, then in O (n 2 + kn ) time this algorithm finds the motif with probability at least $1-{1\over 2^x}$ for every $G\in \Sigma^{\rho}-\Psi_{\rho, h,\epsilon}(\Sigma)$, where ρ is the length of the motif, h is a parameter with ρ *** 4h *** *** 2 logn , and *** ρ , h ,*** (Σ ) is a small subset of at most $2^{-\Theta(\epsilon^2 h)}$ fraction of the sequences in Σ ρ . The constants c 0 , *** , *** 1 and *** 2 do not depend on x when x is a parameter of order O (logn ). Our algorithm can take any number k sequences as input.