Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts

Authors:
Sven Rahmann;Eric Rivals
Affiliations:
-;-
Venue:
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Year:
2000

Citing 6
Cited 1

Fast text searching: allowing errors

Communications of the ACM
Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Theory and application of Marsaglia's monkey test for pseudorandom number generators

ACM Transactions on Modeling and Computer Simulation (TOMACS)
An introduction to the analysis of algorithms

An introduction to the analysis of algorithms
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms

The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
q-gram based database searching using a suffix array (QUASAR)

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology

High Similarity Sequence Comparison in Clustering Large Sequence Databases

CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The number of missing words (NMW) of length q in a text, and the number of common words (NCW) of two texts are useful text statistics. Knowing the distribution of the NMW in a random text is essential for the construction of so-called monkey tests for pseudorandom number generators. Knowledge of the distribution of the NCW of two independent random texts is useful for the average case analysis of a family of fast pattern matching algorithms, namely those which use a technique called q-gram filtration. Despite these important applications, we are not aware of any exact studies of these text statistics. We propose an efficient method to compute their expected values exactly. The difficulty of the computation lies in the strong dependence of successive words, as they overlap by (q-1) characters. Our method is based on the enumeration of all string autocorrelations of length q, i.e., of the ways a word of length q can overlap itself. For this, we present the first efficient algorithm. Furthermore, by assuming the words are independent, we obtain very simple approximation formulas, which are shown to be surprisingly good when compared to the exact values.