Fast text searching: allowing errors
Communications of the ACM
The distribution of subword counts is usually normal
European Journal of Combinatorics
Text algorithms
An introduction to the analysis of algorithms
An introduction to the analysis of algorithms
Matching a set of strings with variable length don't cares
Theoretical Computer Science
The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
Window-accumulated subsequence matching problem is linear
PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Average Case Analysis of Algorithms on Sequences
Average Case Analysis of Algorithms on Sequences
Compact recognizers of episode sequences
Information and Computation
CPM '97 Proceedings of the 8th Annual Symposium on Combinatorial Pattern Matching
ESA '99 Proceedings of the 7th Annual European Symposium on Algorithms
On the Approximate Pattern Occurrences in a Text
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Fixed- vs. variable-length patterns for detecting suspicious process behavior
Journal of Computer Security
Reliable Detection of Episodes in Event Sequences
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Faster variance computation for patterns with gaps
MedAlg'12 Proceedings of the First Mediterranean conference on Design and Analysis of Algorithms
Hi-index | 0.00 |
We consider the sequence comparison problem, also known as "hidden pattern" problem, where one searches for a given subsequence in a text (rather than a string understood as a sequence of consecutive symbols). A characteristic parameter is the number of occurrences of a given pattern w of length m as a subsequence in a random text of length n generated by a memoryless source. Spacings between letters of the pattern may either be constrained or not in order to define valid occurrences. We determine the mean and the variance of the number of occurrences, and establish a Gaussian limit law. These results are obtained via combinatorics on words, formal language techniques, and methods of analytic combinatorics based on generating functions and convergence of moments. The motivation to study this problem comes from an attempt at finding a reliable threshold for intrusion detections, from textual data processing applications, and from molecular biology.