An upper bound on the hardness of exact matrix based motif discovery

Authors:
Paul Horton;Wataru Fujibuchi
Affiliations:
Computational Biology Research Center, National Institute of Advanced Industrial Science, Japan;Computational Biology Research Center, National Institute of Advanced Industrial Science, Japan
Venue:
CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Year:
2005

Citing 5
Cited 0

Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization

Machine Learning - Special issue on applications in molecular biology
Finding similar regions in many strings

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
On approximation algorithms for local multiple alignment

RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
Finding similar regions in many sequences

Journal of Computer and System Sciences - STOC 1999
Tsukuba BB: A Branch and Bound Algorithm for Local Multiple Sequence Alignment

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching

Quantified Score

Hi-index	0.00

Visualization

Abstract

Motif discovery is the problem of finding local patterns or motifs from a set of unlabeled sequences. One common representation of a motif is a Markov model known as a score matrix. Matrix based motif discovery has been extensively studied but no positive results have been known regarding its theoretical hardness. We present the first non-trivial upper bound on the complexity (worst-case computation time) of this problem. Other than linear terms, our bound depends only on the motif width w (which is typically 5-20) and is a dramatic improvement relative to previously known bounds. We prove this bound by relating the motif discovery problem to a search problem over permutations of strings of length w, in which the permutations have a particular property. We give a constructive proof of an upper bound on the number of such permutations. For an alphabet size of σ (typically 4) the trivial bound is $n! \approx ({\frac{n}{e}})^n, n={\sigma}^w$. Our bound is roughly n(σlogσn)n. We relate this theoretical result to the exact motif discovery program, TsukubaBB, whose algorithm contains ideas which inspired the result. We describe a recent improvement to the TsukubaBB program which can give a speed up of nine or more and use a dataset of REB1 transcription factor binding sites to illustrate that exact methods can indeed be used in some practical situations.