Experiments on the zero frequency problem

  • Authors:
  • J. G. Cleary;W. J. Teahan

  • Affiliations:
  • -;-

  • Venue:
  • DCC '95 Proceedings of the Conference on Data Compression
  • Year:
  • 1995

Quantified Score

Hi-index 0.00

Visualization

Abstract

Summary form only given. A fundamental problem in the construction of statistical techniques for data compression of sequential text is the generation of probabilities from counts of previous occurrences. Each context used in the statistical model accumulates counts of the number of times each symbol has occurred in that context. So in a binary alphabet there will be two counts C/sub 0/ and C/sub 1/ (the number of times a 0 or 1 has occurred). The problem then is to take the counts and generate from them a probability that the next character will be a 0 or 1. A naive estimate of the probability of character i could be obtained by the ratio p/sub i/=C/sub i//(C/sub 0/+C/sub i/). A fundamental problem with this is that it will generate a zero probability if C/sub 0/ or C/sub 1/ is zero. Unfortunately, a zero probability prevents coding from working correctly as the "optimum" code length in this case is infinite. Consequently any estimate of the probabilities must be non-zero even in the presence of zero counts. This problem is called the zero frequency problem . A well known solution to the problem was formulated by Laplace and is known as Laplace's law of succession. We have investigated the correctness of Laplace's law by experiment.