Speeding up exact motif discovery by bounding the expected clump size

Authors:
Tobias Marschall;Sven Rahmann
Affiliations:
Bioinformatics for High-Throughput Technologies, Algorithm Engineering, Computer Science Department, TU Dortmund, Dortmund, Germany;Bioinformatics for High-Throughput Technologies, Algorithm Engineering, Computer Science Department, TU Dortmund, Dortmund, Germany
Venue:
WABI'10 Proceedings of the 10th international conference on Algorithms in bioinformatics
Year:
2010

Citing 7
Cited 0

Motif statistics

Theoretical Computer Science
A Statistical Method for Finding Transcription Factor Binding Sites

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Spelling Approximate Repeated or Common Motifs Using a Suffix Tree

LATIN '98 Proceedings of the Third Latin American Symposium on Theoretical Informatics
Waiting times for clumps of patterns and for structured motifs in random sequences

Discrete Applied Mathematics
Probabilistic Arithmetic Automata and Their Application to Pattern Matching Statistics

CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
Efficient exact motif discovery

Bioinformatics
Modeling evolutionary fitness for DNA motif discovery

Proceedings of the 11th Annual conference on Genetic and evolutionary computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The overlapping structure of complex patterns, such as IUPAC motifs, significantly affects their statistical properties and should be taken into account in motif discovery algorithms. The contribution of this paper is twofold. On the one hand, we give surprisingly simple formulas for the expected size and weight of motif clumps (maximal overlapping sets of motif matches in a text). In contrast to previous results, we show that these expected values can be computed without matrix inversions. On the other hand, we show how these results can be algorithmically exploited to improve an exact motif discovery algorithm. First, the algorithm can be efficiently generalized to arbitrary finite-memory text models, whereas it was previously limited to i.i.d. texts. Second, we achieve a speed-up of up to a factor of 135. Our open-source (GPL) implementation is available at http://www.rahmannlab.de/software.