An estimate of an upper bound for the entropy of English

  • Authors:
  • Peter F. Brown;Vincent J. Della Pietra;Robert L. Mercer;Stephen A. Della Pietra;Jennifer C. Lai

  • Affiliations:
  • IBM T. J. Watson Research Center;IBM T. J. Watson Research Center;IBM T. J. Watson Research Center;IBM T. J. Watson Research Center;IBM T. J. Watson Research Center

  • Venue:
  • Computational Linguistics
  • Year:
  • 1992

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present an estimate of an upper bound of 1.75 bits for the entropy of characters in printed English, obtained by constructing a word trigram model and then computing the cross-entropy between this model and a balanced sample of English text. We suggest the well-known and widely available Brown Corpus of printed English as a standard against which to measure progress in language modeling and offer our bound as the first of what we hope will be a series of steadily decreasing bounds.