Theoretical Computer Science
A Statistical Method for Finding Transcription Factor Binding Sites
Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Spelling Approximate Repeated or Common Motifs Using a Suffix Tree
LATIN '98 Proceedings of the Third Latin American Symposium on Theoretical Informatics
Waiting times for clumps of patterns and for structured motifs in random sequences
Discrete Applied Mathematics
Probabilistic Arithmetic Automata and Their Application to Pattern Matching Statistics
CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
Efficient exact motif discovery
Bioinformatics
Modeling evolutionary fitness for DNA motif discovery
Proceedings of the 11th Annual conference on Genetic and evolutionary computation
Hi-index | 0.00 |
The overlapping structure of complex patterns, such as IUPAC motifs, significantly affects their statistical properties and should be taken into account in motif discovery algorithms. The contribution of this paper is twofold. On the one hand, we give surprisingly simple formulas for the expected size and weight of motif clumps (maximal overlapping sets of motif matches in a text). In contrast to previous results, we show that these expected values can be computed without matrix inversions. On the other hand, we show how these results can be algorithmically exploited to improve an exact motif discovery algorithm. First, the algorithm can be efficiently generalized to arbitrary finite-memory text models, whereas it was previously limited to i.i.d. texts. Second, we achieve a speed-up of up to a factor of 135. Our open-source (GPL) implementation is available at http://www.rahmannlab.de/software.