Asymptotic Behavior of the Height in a Digital Search Tree and the Longest Phrase of the Lempel--Ziv Scheme

  • Authors:
  • Charles Knessl;Wojciech Szpankowski

  • Affiliations:
  • -;-

  • Venue:
  • SIAM Journal on Computing
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study the height of a digital search tree (DST) built from n random strings generated by an unbiased memoryless source (i.e., all symbols are equally likely). We shall argue that the height of such a tree is equivalent to the length of the longest phrase in the Lempel--Ziv parsing scheme that partitions a random sequence into n phrases. We also analyze the longest phrase in the Lempel--Ziv scheme in which a string of fixed length m is parsed into a random number of phrases. In the course of our analysis, we shall identify four natural regions of the height distribution and characterize them asymptotically for large n. In particular, for the region where most of the probability mass is concentrated, the asymptotic distribution of the height exhibits an exponential of a Gaussian distribution (with an oscillating term) around the most probable value $k_1 = \lfloor \log_2 n + \sqrt{2\log_2 n} - \log_2 ( \sqrt{2 \log_2 n} ) + \frac{1}{\log 2} - \frac{1}{2} \rfloor +1$. More precisely, we shall prove that the asymptotic distribution of a DST is concentrated on either the one point k1 or the two points k1-1 and k1, which actually proves (slightly modified) Kesten's conjecture quoted in [Probab. Theory Related Fields, 79 (1988), pp. 509--542]. Finally, we compare our findings for DST with the asymptotic distributions of the height for other digital trees such as tries and PATRICIA tries. We derive these results by a combination of analytic methods such as generating functions, Laplace transform, the saddle point method, and ideas of applied mathematics such as linearization, asymptotic matching, and the WKB method. Our analysis makes certain assumptions about the forms of some of the asymptotic expansions as well as their asymptotic matching. We also present detailed numerical verification of our results.