On average sequence complexity

  • Authors:
  • Svante Janson;Stefano Lonardi;Wojciech Szpankowski

  • Affiliations:
  • Department of Mathematics, Uppsala University, Uppsala, Sweden;Department of Computer Science and Engineering, University of California, Riverside, CA;Department of Computer Sciences, Purdue University, West Lafayette, IN

  • Venue:
  • Theoretical Computer Science
  • Year:
  • 2004

Quantified Score

Hi-index 5.23

Visualization

Abstract

In this paper we study the average behavior of the number of distinct substrings in a text of size n over an alphabet of cardinality k. This quantity is called the complexity index and it captures the "richness of the language" used in a sequence. For example, sequences with low complexity index contain a large number of repeated substrings and they eventually become periodic (e.g., tandem repeats in a DNA sequence). In order to identify unusually low- or high-complexity strings one needs to determine how far are the complexities of the strings under study from the average or maximum string complexity. While the maximum string complexity was studied quite extensively in the past, to the best of our knowledge there are no results concerning the average complexity. We first prove that for a sequence generated by a mixing model (which includes Markov sources) the average complexity is asymptotically equal to n2/2 which coincides with the maximum string complexity. However, for a memoryless source we establish a more precise result, namely the average string complexity is n2/2 - n logk n + (1 + ( 1 - γ)/ln k + φk (logkn) + o(1))n where γ ≈ 0.577 and φk(x) is a periodic function with a small amplitude for small alphabet size.