Text compression
Self-alignments in words and their applications
Journal of Algorithms
An introduction to Kolmogorov complexity and its applications
An introduction to Kolmogorov complexity and its applications
A generalized suffix tree and its (un)expected asymptotic behaviors
SIAM Journal on Computing
On the combinatorics of finite words
Theoretical Computer Science
A Space-Economical Suffix Tree Construction Algorithm
Journal of the ACM (JACM)
Average Case Analysis of Algorithms on Sequences
Average Case Analysis of Algorithms on Sequences
Repetition Complexity of Words
COCOON '02 Proceedings of the 8th Annual International Conference on Computing and Combinatorics
Optimal suffix tree construction with large alphabets
FOCS '97 Proceedings of the 38th Annual Symposium on Foundations of Computer Science
Second preimage attacks on dithered hash functions
EUROCRYPT'08 Proceedings of the theory and applications of cryptographic techniques 27th annual international conference on Advances in cryptology
Hi-index | 5.23 |
In this paper we study the average behavior of the number of distinct substrings in a text of size n over an alphabet of cardinality k. This quantity is called the complexity index and it captures the "richness of the language" used in a sequence. For example, sequences with low complexity index contain a large number of repeated substrings and they eventually become periodic (e.g., tandem repeats in a DNA sequence). In order to identify unusually low- or high-complexity strings one needs to determine how far are the complexities of the strings under study from the average or maximum string complexity. While the maximum string complexity was studied quite extensively in the past, to the best of our knowledge there are no results concerning the average complexity. We first prove that for a sequence generated by a mixing model (which includes Markov sources) the average complexity is asymptotically equal to n2/2 which coincides with the maximum string complexity. However, for a memoryless source we establish a more precise result, namely the average string complexity is n2/2 - n logk n + (1 + ( 1 - γ)/ln k + φk (logkn) + o(1))n where γ ≈ 0.577 and φk(x) is a periodic function with a small amplitude for small alphabet size.