An upper bound on the hardness of exact matrix based motif discovery

  • Authors:
  • Paul Horton;Wataru Fujibuchi

  • Affiliations:
  • Computational Biology Research Center, National Institute of Advanced Industrial Science, Japan;Computational Biology Research Center, National Institute of Advanced Industrial Science, Japan

  • Venue:
  • CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Motif discovery is the problem of finding local patterns or motifs from a set of unlabeled sequences. One common representation of a motif is a Markov model known as a score matrix. Matrix based motif discovery has been extensively studied but no positive results have been known regarding its theoretical hardness. We present the first non-trivial upper bound on the complexity (worst-case computation time) of this problem. Other than linear terms, our bound depends only on the motif width w (which is typically 5-20) and is a dramatic improvement relative to previously known bounds. We prove this bound by relating the motif discovery problem to a search problem over permutations of strings of length w, in which the permutations have a particular property. We give a constructive proof of an upper bound on the number of such permutations. For an alphabet size of σ (typically 4) the trivial bound is $n! \approx ({\frac{n}{e}})^n, n={\sigma}^w$. Our bound is roughly n(σlogσn)n. We relate this theoretical result to the exact motif discovery program, TsukubaBB, whose algorithm contains ideas which inspired the result. We describe a recent improvement to the TsukubaBB program which can give a speed up of nine or more and use a dataset of REB1 transcription factor binding sites to illustrate that exact methods can indeed be used in some practical situations.