Motif statistics

  • Authors:
  • Pierre Nicodème;Bruno Salvy;Philippe Flajolet

  • Affiliations:
  • Laboratoire de Statistique, et Génomes CNRS, La Génopole, 523 Place des Terrasses, 91000 Evry, France;Algorithms Project, Inria Rocquencourt, B.P. 105-Domaine de Voluceau-Rocquencourt, 78153 Le Chesnay Cedex, France;Algorithms Project, Inria Rocquencourt, B.P. 105-Domaine de Voluceau-Rocquencourt, 78153 Le Chesnay Cedex, France

  • Venue:
  • Theoretical Computer Science
  • Year:
  • 2002

Quantified Score

Hi-index 5.23

Visualization

Abstract

We present a complete analysis of the statistics of number of occurrences of a regular expression pattern in a random text. This covers "motifs" widely used in computational biology. Our approach is based on: (i) classical constructive results in automata and formal language theory; (ii) analytic combinatorics that is used for deriving asymptotic properties from generating functions; (iii) computer algebra in order to determine generating functions explicitly, analyse generating functions and extract coefficients efficiently. We provide constructions for overlapping or non-overlapping matches of a regular expression. A companion implementation produces: multivariate generating functions for the statistics under study; a fast computation of their Taylor coefficients which yields exact values of the moments with typical application to random texts of size 30,000; precise asymptotic formulæ that allow predictions in texts of arbitrarily large sizes. Our implementation was tested by comparing predictions of the number of occurrences of motifs against the 7 megabytes amino acid database PRODOM. We handled more than 88% of the standard collection of PROSITE motifs with our programs. Such comparisons help detect which motifs are observed in real biological data more or less frequently than theoretically predicted.