A convergent gambling estimate of the entropy of English

  • Authors:
  • T. Cover;R. King

  • Affiliations:
  • -;-

  • Venue:
  • IEEE Transactions on Information Theory
  • Year:
  • 2006

Quantified Score

Hi-index 754.84

Visualization

Abstract

In his original paper on the subject, Shannon found upper and lower bounds for the entropy of printed English based on the number of trials required for a subject to guess subsequent symbols in a given text. The guessing approach precludes asymptotic consistency of either the upper or lower bounds except for degenerate ergodic processes. Shannon's technique of guessing the next symbol is altered by having the subject place sequential bets on the next symbol of text. IfS_{n}denotes the subject's capital afternbets at27for1odds, and if it is assumed that the subject knows the underlying probability distribution for the processX, then the entropy estimate ishat{H}_{n}(X)=(1-(1/n) log_{27}S_{n}) log_{2} 27bits/symbol. If the subject does not know the true probability distribution for the stochastic process, thenhat{H}_{n}(X)is an asymptotic upper bound for the true entropy. IfXis stationary,Ehat{H}_{n}(X)rightarrowH(X), H(X)being the true entropy of the process. Moreover, if X is ergodic, then by the Shannon-McMillan-Breiman theoremhat{H}_{n}(X)rightarrowH(X)with probability one. Preliminary indications are that English text has an entropy of approximately1.3bits/symbol, which agrees well with Shannon's estimate. In his original paper on the subject, Shannon found upper and lower bounds for the entropy of printed English based on the number of trials required for a subject to guess subsequent symbols in a given text. The guessing approach precludes asymptotic consistency of either the upper or lower bounds except for degenerate ergodic processes. Shannon's technique of guessing the next symbol is altered by having the subject place sequential bets on the next symbol of text. IfS_{n}denotes the subject's capital afternbets at27for1odds, and if it is assumed that the subject knows the underlying probability distribution for the processX, then the entropy estimate ishat{H}_{n}(X)=(1-(1/n) log_{27}S_{n}) log_{2} 27bits/symbol. If the subject does not know the true probability distribution for the stochastic process, thenhat{H}_{n}(X)is an asymptotic upper - bound for the true entropy. IfXis stationary,Ehat{H}_{n}(X)rightarrowH(X), H(X)being the true entropy of the process. Moreover, if X is ergodic, then by the Shannon-McMillan-Breiman theoremhat{H}_{n}(X)rightarrowH(X)with probability one. Preliminary indications are that English text has an entropy of approximately1.3bits/symbol, which agrees well with Shannon's estimate. In his original paper on the subject, Shannon found upper and lower bounds for the entropy of printed English based on the number of trials required for a subject to guess subsequent symbols in a given text. The guessing approach precludes asymptotic consistency of either the upper or lower bounds except for degenerate ergodic processes. Shannon's technique of guessing the next symbol is altered by having the subject place sequential bets on the next symbol of text. IfS_{n}denotes the subject's capital afternbets at27for1odds, and if it is assumed that the subject knows the underlying probability distribution for the processX, then the entropy estimate ishat{H}_{n}(X)=(1-(1/n) log_{27}S_{n}) log_{2} 27bits/symbol. If the subject does not know the true probability distribution for the stochastic process, thenhat{H}_{n}(X)is an asymptotic upper bound for the true entropy. IfXis stationary,Ehat{H}_{n}(X)rightarrowH(X), H(X)being the true entropy of the process.Moreover, if X is ergodic, then by the Shannon-McMillan-Breiman theoremhat{H}_{n}(X)rightarrowH(X)with probability one. Preliminary indications are that English text has an entropy of a