Motif Statistics

Authors:
Pierre Nicodème;Bruno Salvy;Philippe Flajolet
Affiliations:
-;-;-
Venue:
ESA '99 Proceedings of the 7th Annual European Symposium on Algorithms
Year:
1999

Citing 5
Cited 13

Automata and formal languages: an introduction

Automata and formal languages: an introduction
Effective asymptotics of linear recurrences with rational coefficients

FPSAC '93 Proceedings of the 5th conference on Formal power series and algebraic combinatorics
A unified approach to word statistics

RECOMB '98 Proceedings of the second annual international conference on Computational molecular biology
Automata and Computability

Automata and Computability
Calculating the Exact Probability of Language-Like Patterns in Biomolecular Sequences

ISMB '98 Proceedings of the 6th International Conference on Intelligent Systems for Molecular Biology

Designing seeds for similarity search in genomic DNA

RECOMB '03 Proceedings of the seventh annual international conference on Research in computational molecular biology
Hidden Pattern Statistics

ICALP '01 Proceedings of the 28th International Colloquium on Automata, Languages and Programming,
Computational Complexity of Word Counting

JOBIM '00 Selected papers from the First International Conference on Computational Biology, Biology, Informatics, and Mathematics
On the number of occurrences of a symbol in words of regular languages

Theoretical Computer Science
Reliable Detection of Episodes in Event Sequences

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Reliable detection of episodes in event sequences

Knowledge and Information Systems
Designing seeds for similarity search in genomic DNA

Journal of Computer and System Sciences - Special issue on bioinformatics II
Superiority and complexity of the spaced seeds

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Hidden word statistics

Journal of the ACM (JACM)
On the complexity of the spaced seeds

Journal of Computer and System Sciences
Superiority of Spaced Seeds for Homology Search

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Frequency of symbol occurrences in simple non-primitive stochastic models

DLT'03 Proceedings of the 7th international conference on Developments in language theory
Constrained pattern matching

ACM Transactions on Algorithms (TALG)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers "motifs" widely used in computational biology. Our approach is based on: (i) classical constructive results in theoretical computer science (automata and formal language theory); (ii) analytic combinatorics to compute asymptotic properties from generating functions; (iii) computer algebra to determine generating functions explicitly, analyse generating functions and extract coefficients efficiently. We provide constructions for overlapping or non-overlapping matches of a regular expression. A companion implementation produces: multivariate generating functions for the statistics under study; a fast computation of their Taylor coefficients which yields exact values of the moments with typical application to random texts of size 30,000; precise asymptotic formul忙 that allow predictions in texts of arbitrarily large sizes. Our implementation was tested by comparing predictions of the number of occurrences of motifs against the 7 megabytes aminoacid database Prodom. We handled more than 88% of the standard collection of Prosite motifs with our programs. Such comparisons help detect which motifs are observed in real biological data more or less frequently than theoretically predicted.