Speeding up exact motif discovery by bounding the expected clump size

  • Authors:
  • Tobias Marschall;Sven Rahmann

  • Affiliations:
  • Bioinformatics for High-Throughput Technologies, Algorithm Engineering, Computer Science Department, TU Dortmund, Dortmund, Germany;Bioinformatics for High-Throughput Technologies, Algorithm Engineering, Computer Science Department, TU Dortmund, Dortmund, Germany

  • Venue:
  • WABI'10 Proceedings of the 10th international conference on Algorithms in bioinformatics
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The overlapping structure of complex patterns, such as IUPAC motifs, significantly affects their statistical properties and should be taken into account in motif discovery algorithms. The contribution of this paper is twofold. On the one hand, we give surprisingly simple formulas for the expected size and weight of motif clumps (maximal overlapping sets of motif matches in a text). In contrast to previous results, we show that these expected values can be computed without matrix inversions. On the other hand, we show how these results can be algorithmically exploited to improve an exact motif discovery algorithm. First, the algorithm can be efficiently generalized to arbitrary finite-memory text models, whereas it was previously limited to i.i.d. texts. Second, we achieve a speed-up of up to a factor of 135. Our open-source (GPL) implementation is available at http://www.rahmannlab.de/software.