A first course in formal language theory
A first course in formal language theory
From regular expressions to deterministic automata
Theoretical Computer Science
The distribution of subword counts is usually normal
European Journal of Combinatorics
Regular expressions into finite automata
Theoretical Computer Science
GFUN: a Maple package for the manipulation of generating and holonomic functions in one variable
ACM Transactions on Mathematical Software (TOMS)
Automata and formal languages: an introduction
Automata and formal languages: an introduction
Effective asymptotics of linear recurrences with rational coefficients
FPSAC '93 Proceedings of the 5th conference on Formal power series and algebraic combinatorics
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
A unified approach to word statistics
RECOMB '98 Proceedings of the second annual international conference on Computational molecular biology
Automata and Computability
Introduction To Automata Theory, Languages, And Computation
Introduction To Automata Theory, Languages, And Computation
Calculating the Exact Probability of Language-Like Patterns in Biomolecular Sequences
ISMB '98 Proceedings of the 6th International Conference on Intelligent Systems for Molecular Biology
Regexpcount, a symbolic package for counting problems on regular expressions and words
Fundamenta Informaticae - Special issue on computing patterns in strings
Computational Methods for Predicting Intramolecular G-Quadruplexes in Nucleotide Sequences
CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Frequency of symbol occurrences in bicomponent stochastic models
Theoretical Computer Science - Developments in language theory
Growth of repetition-free words: a review
Theoretical Computer Science - The art of theory
Pattern statistics and Vandermonde matrices
Theoretical Computer Science - In honour of Professor Christian Choffrut on the occasion of his 60th birthday
Analytic combinatorics: a calculus of discrete structures
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Shuffling biological sequences with motif constraints
Journal of Discrete Algorithms
Probabilistic Arithmetic Automata and Their Application to Pattern Matching Statistics
CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
Counting Patterns in Degenerated Sequences
PRIB '09 Proceedings of the 4th IAPR International Conference on Pattern Recognition in Bioinformatics
Average value and variance of pattern statistics in rational models
CIAA'07 Proceedings of the 12th international conference on Implementation and application of automata
Controlled non-uniform random generation of decomposable structures
Theoretical Computer Science
Speeding up exact motif discovery by bounding the expected clump size
WABI'10 Proceedings of the 10th international conference on Algorithms in bioinformatics
Construction of minimal deterministic finite automata from biological motifs
Theoretical Computer Science
Common substrings in random strings
CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Assessing the significance of sets of words
CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Pattern occurrences in multicomponent models
STACS'05 Proceedings of the 22nd annual conference on Theoretical Aspects of Computer Science
On the maximum coefficients of rational formal series in commuting variables
DLT'04 Proceedings of the 8th international conference on Developments in Language Theory
Pattern matching statistics on correlated sources
LATIN'06 Proceedings of the 7th Latin American conference on Theoretical Informatics
Counting occurrences for a finite set of words: Combinatorial methods
ACM Transactions on Algorithms (TALG)
Regexpcount, a symbolic package for counting problems on regular expressions and words
Fundamenta Informaticae - Computing Patterns in Strings
Probabilistic Arithmetic Automata and Their Applications
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Faster variance computation for patterns with gaps
MedAlg'12 Proceedings of the First Mediterranean conference on Design and Analysis of Algorithms
Theoretical Computer Science
Large deviation properties for patterns
Journal of Discrete Algorithms
Hi-index | 5.23 |
We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers "motifs" widely used in computational biology. Our approach is based on: (i) classical constructive results in automata and formal language theory; (ii) analytic combinatorics that is used for deriving asymptotic properties from generating functions; (iii) computer algebra in order to determine generating functions explicitly, analyse generating functions and extract coefficients efficiently. We provide constructions for overlapping or non-overlapping matches of a regular expression. A companion implementation produces: multivariate generating functions for the statistics under study; a fast computation of their Taylor coefficients which yields exact values of the moments with typical application to random texts of size 30,000; precise asymptotic formulæ that allow predictions in texts of arbitrarily large sizes. Our implementation was tested by comparing predictions of the number of occurrences of motifs against the 7 megabytes amino acid database PRODOM. We handled more than 88% of the standard collection of PROSITE motifs with our programs. Such comparisons help detect which motifs are observed in real biological data more or less frequently than theoretically predicted.