Automata and formal languages: an introduction
Automata and formal languages: an introduction
Effective asymptotics of linear recurrences with rational coefficients
FPSAC '93 Proceedings of the 5th conference on Formal power series and algebraic combinatorics
A unified approach to word statistics
RECOMB '98 Proceedings of the second annual international conference on Computational molecular biology
Automata and Computability
Calculating the Exact Probability of Language-Like Patterns in Biomolecular Sequences
ISMB '98 Proceedings of the 6th International Conference on Intelligent Systems for Molecular Biology
Designing seeds for similarity search in genomic DNA
RECOMB '03 Proceedings of the seventh annual international conference on Research in computational molecular biology
ICALP '01 Proceedings of the 28th International Colloquium on Automata, Languages and Programming,
Computational Complexity of Word Counting
JOBIM '00 Selected papers from the First International Conference on Computational Biology, Biology, Informatics, and Mathematics
On the number of occurrences of a symbol in words of regular languages
Theoretical Computer Science
Reliable Detection of Episodes in Event Sequences
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Reliable detection of episodes in event sequences
Knowledge and Information Systems
Designing seeds for similarity search in genomic DNA
Journal of Computer and System Sciences - Special issue on bioinformatics II
Superiority and complexity of the spaced seeds
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Journal of the ACM (JACM)
On the complexity of the spaced seeds
Journal of Computer and System Sciences
Superiority of Spaced Seeds for Homology Search
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Frequency of symbol occurrences in simple non-primitive stochastic models
DLT'03 Proceedings of the 7th international conference on Developments in language theory
ACM Transactions on Algorithms (TALG)
Hi-index | 0.00 |
We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers "motifs" widely used in computational biology. Our approach is based on: (i) classical constructive results in theoretical computer science (automata and formal language theory); (ii) analytic combinatorics to compute asymptotic properties from generating functions; (iii) computer algebra to determine generating functions explicitly, analyse generating functions and extract coefficients efficiently. We provide constructions for overlapping or non-overlapping matches of a regular expression. A companion implementation produces: multivariate generating functions for the statistics under study; a fast computation of their Taylor coefficients which yields exact values of the moments with typical application to random texts of size 30,000; precise asymptotic formul忙 that allow predictions in texts of arbitrarily large sizes. Our implementation was tested by comparing predictions of the number of occurrences of motifs against the 7 megabytes aminoacid database Prodom. We handled more than 88% of the standard collection of Prosite motifs with our programs. Such comparisons help detect which motifs are observed in real biological data more or less frequently than theoretically predicted.