Motif statistics

Authors:
Pierre Nicodème;Bruno Salvy;Philippe Flajolet
Affiliations:
Laboratoire de Statistique, et Génomes CNRS, La Génopole, 523 Place des Terrasses, 91000 Evry, France;Algorithms Project, Inria Rocquencourt, B.P. 105-Domaine de Voluceau-Rocquencourt, 78153 Le Chesnay Cedex, France;Algorithms Project, Inria Rocquencourt, B.P. 105-Domaine de Voluceau-Rocquencourt, 78153 Le Chesnay Cedex, France
Venue:
Theoretical Computer Science
Year:
2002

Citing 12
Cited 24

A first course in formal language theory

A first course in formal language theory
From regular expressions to deterministic automata

Theoretical Computer Science
The distribution of subword counts is usually normal

European Journal of Combinatorics
Regular expressions into finite automata

Theoretical Computer Science
GFUN: a Maple package for the manipulation of generating and holonomic functions in one variable

ACM Transactions on Mathematical Software (TOMS)
Automata and formal languages: an introduction

Automata and formal languages: an introduction
Effective asymptotics of linear recurrences with rational coefficients

FPSAC '93 Proceedings of the 5th conference on Formal power series and algebraic combinatorics
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms

The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
A unified approach to word statistics

RECOMB '98 Proceedings of the second annual international conference on Computational molecular biology
Automata and Computability

Automata and Computability
Introduction To Automata Theory, Languages, And Computation

Introduction To Automata Theory, Languages, And Computation
Calculating the Exact Probability of Language-Like Patterns in Biomolecular Sequences

ISMB '98 Proceedings of the 6th International Conference on Intelligent Systems for Molecular Biology

Regexpcount, a symbolic package for counting problems on regular expressions and words

Fundamenta Informaticae - Special issue on computing patterns in strings
Computational Methods for Predicting Intramolecular G-Quadruplexes in Nucleotide Sequences

CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Frequency of symbol occurrences in bicomponent stochastic models

Theoretical Computer Science - Developments in language theory
Growth of repetition-free words: a review

Theoretical Computer Science - The art of theory
Pattern statistics and Vandermonde matrices

Theoretical Computer Science - In honour of Professor Christian Choffrut on the occasion of his 60th birthday
Analytic combinatorics: a calculus of discrete structures

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Shuffling biological sequences with motif constraints

Journal of Discrete Algorithms
Probabilistic Arithmetic Automata and Their Application to Pattern Matching Statistics

CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
Counting Patterns in Degenerated Sequences

PRIB '09 Proceedings of the 4th IAPR International Conference on Pattern Recognition in Bioinformatics
Average value and variance of pattern statistics in rational models

CIAA'07 Proceedings of the 12th international conference on Implementation and application of automata
Controlled non-uniform random generation of decomposable structures

Theoretical Computer Science
Speeding up exact motif discovery by bounding the expected clump size

WABI'10 Proceedings of the 10th international conference on Algorithms in bioinformatics
Construction of minimal deterministic finite automata from biological motifs

Theoretical Computer Science
Common substrings in random strings

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Assessing the significance of sets of words

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Pattern occurrences in multicomponent models

STACS'05 Proceedings of the 22nd annual conference on Theoretical Aspects of Computer Science
On the maximum coefficients of rational formal series in commuting variables

DLT'04 Proceedings of the 8th international conference on Developments in Language Theory
Pattern matching statistics on correlated sources

LATIN'06 Proceedings of the 7th Latin American conference on Theoretical Informatics
Counting occurrences for a finite set of words: Combinatorial methods

ACM Transactions on Algorithms (TALG)
Regexpcount, a symbolic package for counting problems on regular expressions and words

Fundamenta Informaticae - Computing Patterns in Strings
Probabilistic Arithmetic Automata and Their Applications

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Faster variance computation for patterns with gaps

MedAlg'12 Proceedings of the First Mediterranean conference on Design and Analysis of Algorithms
Sparse approaches for the exact distribution of patterns in long state sequences generated by a Markov source

Theoretical Computer Science
Large deviation properties for patterns

Journal of Discrete Algorithms

Quantified Score

Hi-index	5.23

Visualization

Abstract

We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers "motifs" widely used in computational biology. Our approach is based on: (i) classical constructive results in automata and formal language theory; (ii) analytic combinatorics that is used for deriving asymptotic properties from generating functions; (iii) computer algebra in order to determine generating functions explicitly, analyse generating functions and extract coefficients efficiently. We provide constructions for overlapping or non-overlapping matches of a regular expression. A companion implementation produces: multivariate generating functions for the statistics under study; a fast computation of their Taylor coefficients which yields exact values of the moments with typical application to random texts of size 30,000; precise asymptotic formulæ that allow predictions in texts of arbitrarily large sizes. Our implementation was tested by comparing predictions of the number of occurrences of motifs against the 7 megabytes amino acid database PRODOM. We handled more than 88% of the standard collection of PROSITE motifs with our programs. Such comparisons help detect which motifs are observed in real biological data more or less frequently than theoretically predicted.