Hidden word statistics

Authors:
Philippe Flajolet;Wojciech Szpankowski;Brigitte Vallée
Affiliations:
INRIA-Rocquencourt, Le Chesnay, France;Purdue University, West Lafayette, Indiana;Université de Caen, Caen Cedex, France, and INRIA-Rocquencourt, Le Chesnay, France
Venue:
Journal of the ACM (JACM)
Year:
2006

Citing 20
Cited 5

Fast text searching: allowing errors

Communications of the ACM
General combinatorial schemas: Gaussian limit distributions and exponential tails

Discrete Mathematics - Special issue on combinatorics and algorithms
The distribution of subword counts is usually normal

European Journal of Combinatorics
Text algorithms

Text algorithms
An introduction to the analysis of algorithms

An introduction to the analysis of algorithms
Matching a set of strings with variable length don't cares

Theoretical Computer Science
The art of computer programming, volume 1 (3rd ed.): fundamental algorithms

The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
On convergence rates in the central limit theorems for combinatorial structures

European Journal of Combinatorics
Window-accumulated subsequence matching problem is linear

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Average Case Analysis of Algorithms on Sequences

Average Case Analysis of Algorithms on Sequences
Compact recognizers of episode sequences

Information and Computation
Episode Matching

CPM '97 Proceedings of the 8th Annual Symposium on Combinatorial Pattern Matching
Motif Statistics

ESA '99 Proceedings of the 7th Annual European Symposium on Algorithms
On the Approximate Pattern Occurrences in a Text

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Reliable Detection of Episodes in Event Sequences

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Detection of Significant Sets of Episodes in Event Sequences

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Large deviations for sums of partly dependent random variables

Random Structures & Algorithms - Isaac Newton Institute Programme “Computation, Combinatorics and Probability”: Part I
Fixed- vs. variable-length patterns for detecting suspicious process behavior

Journal of Computer Security
Analytic Combinatorics

Analytic Combinatorics

Analytic combinatorics: a calculus of discrete structures

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
t-Wise independence with local dependencies

Information Processing Letters
Constrained pattern matching

ACM Transactions on Algorithms (TALG)
Algebraic aspects of some Riordan arrays related to binary words avoiding a pattern

Theoretical Computer Science
Pattern matching statistics on correlated sources

LATIN'06 Proceedings of the 7th Latin American conference on Theoretical Informatics

Quantified Score

Hi-index	0.01

Visualization

Abstract

We consider the sequence comparison problem, also known as “hidden” pattern problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is the number of occurrences of a given pattern w of length m as a subsequence in a random text of length n generated by a memoryless source. Spacings between letters of the pattern may either be constrained or not in order to define valid occurrences. We determine the mean and the variance of the number of occurrences, and establish a Gaussian limit law and large deviations. These results are obtained via combinatorics on words, formal language techniques, and methods of analytic combinatorics based on generating functions. The motivations to study this problem come from an attempt at finding a reliable threshold for intrusion detections, from textual data processing applications, and from molecular biology.