Probabilistic Analysis of a Motif Discovery Algorithm for Multiple Sequences

Authors:
Bin Fu;Ming-Yang Kao;Lusheng Wang
Affiliations:
binfu@cs.panam.edu;kao@northwestern.edu;lwang@cs.cityu.edu.hk
Venue:
SIAM Journal on Discrete Mathematics
Year:
2009

Citing 8
Cited 0

Randomized algorithms

Randomized algorithms
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Finding similar regions in many strings

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Efficient approximation algorithms for the Hamming center problem

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
On the closest string and substring problems

Journal of the ACM (JACM)
Combinatorial Approaches to Finding Subtle Signals in DNA Sequences

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Distinguishing string selection problems

Information and Computation
Discovering Almost Any Hidden Motif from Multiple Sequences in Polynomial Time with Low Sample Complexity and High Success Probability

TAMC '09 Proceedings of the 6th Annual Conference on Theory and Applications of Models of Computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study a natural probabilistic model for motif discovery that has been used to experimentally test the quality of motif discovery programs. In this model, there are $k$ background sequences, and each character in a background sequence is a random character from an alphabet $\Sigma$. A motif $G=g_1g_2\cdots g_m$ is a string of $m$ characters. Each background sequence is implanted into a probabilistically generated approximate copy of $G$. For an approximate copy $b_1b_2\cdots b_m$ of $G$, every character $b_i$ is probabilistically generated such that the probability for $b_i\neq g_i$ is at most $\alpha$. In this paper, we give the first analytical proof that multiple background sequences do help with finding subtle and faint motifs. This work is a theoretical approach with a rigorous probabilistic analysis. We develop an algorithm that under the probabilistic model can find the implanted motif with high probability when the number of background sequences is reasonably large. Specifically, we prove that for $\alpha0$ such that if the length of the motif is at least $\delta_0\log n$, the alphabet has at least $t_0$ characters, and there are at least $\delta_1\log n_0$ input sequences, then in $O(n^3)$ time our algorithm finds the motif with probability at least $1-\frac{1}{2^x}$, where $n$ is the longest length of any input sequence and $n_0\leq n$ is an upper bound for the length of the motif.