On average sequence complexity

Authors:
Svante Janson;Stefano Lonardi;Wojciech Szpankowski
Affiliations:
Department of Mathematics, Uppsala University, Uppsala, Sweden;Department of Computer Science and Engineering, University of California, Riverside, CA;Department of Computer Sciences, Purdue University, West Lafayette, IN
Venue:
Theoretical Computer Science
Year:
2004

Citing 9
Cited 1

Text compression

Text compression
Self-alignments in words and their applications

Journal of Algorithms
An introduction to Kolmogorov complexity and its applications

An introduction to Kolmogorov complexity and its applications
A generalized suffix tree and its (un)expected asymptotic behaviors

SIAM Journal on Computing
On the combinatorics of finite words

Theoretical Computer Science
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Average Case Analysis of Algorithms on Sequences

Average Case Analysis of Algorithms on Sequences
Repetition Complexity of Words

COCOON '02 Proceedings of the 8th Annual International Conference on Computing and Combinatorics
Optimal suffix tree construction with large alphabets

FOCS '97 Proceedings of the 38th Annual Symposium on Foundations of Computer Science

Second preimage attacks on dithered hash functions

EUROCRYPT'08 Proceedings of the theory and applications of cryptographic techniques 27th annual international conference on Advances in cryptology

Quantified Score

Hi-index	5.23

Visualization

Abstract

In this paper we study the average behavior of the number of distinct substrings in a text of size n over an alphabet of cardinality k. This quantity is called the complexity index and it captures the "richness of the language" used in a sequence. For example, sequences with low complexity index contain a large number of repeated substrings and they eventually become periodic (e.g., tandem repeats in a DNA sequence). In order to identify unusually low- or high-complexity strings one needs to determine how far are the complexities of the strings under study from the average or maximum string complexity. While the maximum string complexity was studied quite extensively in the past, to the best of our knowledge there are no results concerning the average complexity. We first prove that for a sequence generated by a mixing model (which includes Markov sources) the average complexity is asymptotically equal to n2/2 which coincides with the maximum string complexity. However, for a memoryless source we establish a more precise result, namely the average string complexity is n2/2 - n logk n + (1 + ( 1 - γ)/ln k + φk (logkn) + o(1))n where γ ≈ 0.577 and φk(x) is a periodic function with a small amplitude for small alphabet size.