Hidden word statistics

  • Authors:
  • Philippe Flajolet;Wojciech Szpankowski;Brigitte Vallée

  • Affiliations:
  • INRIA-Rocquencourt, Le Chesnay, France;Purdue University, West Lafayette, Indiana;Université de Caen, Caen Cedex, France, and INRIA-Rocquencourt, Le Chesnay, France

  • Venue:
  • Journal of the ACM (JACM)
  • Year:
  • 2006

Quantified Score

Hi-index 0.01

Visualization

Abstract

We consider the sequence comparison problem, also known as “hidden” pattern problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is the number of occurrences of a given pattern w of length m as a subsequence in a random text of length n generated by a memoryless source. Spacings between letters of the pattern may either be constrained or not in order to define valid occurrences. We determine the mean and the variance of the number of occurrences, and establish a Gaussian limit law and large deviations. These results are obtained via combinatorics on words, formal language techniques, and methods of analytic combinatorics based on generating functions. The motivations to study this problem come from an attempt at finding a reliable threshold for intrusion detections, from textual data processing applications, and from molecular biology.