An estimate of an upper bound for the entropy of English

Authors:
Peter F. Brown;Vincent J. Della Pietra;Robert L. Mercer;Stephen A. Della Pietra;Jennifer C. Lai
Affiliations:
IBM T. J. Watson Research Center;IBM T. J. Watson Research Center;IBM T. J. Watson Research Center;IBM T. J. Watson Research Center;IBM T. J. Watson Research Center
Venue:
Computational Linguistics
Year:
1992

Citing 2
Cited 27

Text compression

Text compression
Elements of information theory

Elements of information theory

Improving statistical language model performance with automatically generated word hierarchies

Computational Linguistics
A Measure of Information

DCC '00 Proceedings of the Conference on Data Compression
Lexical Post-Processing Optimization for Handwritten Word Recognition

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Introduction to the special issue on computational linguistics using large corpora

Computational Linguistics - Special issue on using large corpora: I
New techniques for context modeling

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
A part of speech estimation method for Japanese unknown words using a statistical model of morphology and context

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Measures and models for phrase recognition

HLT '93 Proceedings of the workshop on Human Language Technology
Segmenting documents by stylistic character

Natural Language Engineering
Predicting sentences using N-gram language models

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
Cross-entropy and linguistic typology

NeMLaP3/CoNLL '98 Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning
Designing for uncertain, asymmetric control: Interaction design for brain-computer interfaces

International Journal of Human-Computer Studies
Modeling morphologically rich languages using split words and unstructured dependencies

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
An estimate method of the minimum entropy of natural languages

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
Multi-style language model for web scale information retrieval

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Improving mention detection robustness to noisy input

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Peddling or creating? investigating the role of twitter in news reporting

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
An observer's information dynamics: Acquisition of information and the origin of the cognitive dynamics

Information Sciences: an International Journal
Natural Language Processing (Almost) from Scratch

The Journal of Machine Learning Research
Comparing entropies within the chinese language

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Statistical behavior analysis of smoothing methods for language models of mandarin data sets

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Methods for combining statistical models of music

CMMR'04 Proceedings of the Second international conference on Computer Music Modeling and Retrieval
A bayesian model for learning SCFGs with discontiguous rules

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Revisiting the predictability of language: response completion in social media

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
An information-theoretic measure to evaluate parsing difficulty across treebanks

ACM Transactions on Speech and Language Processing (TSLP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an estimate of an upper bound of 1.75 bits for the entropy of characters in printed English, obtained by constructing a word trigram model and then computing the cross-entropy between this model and a balanced sample of English text. We suggest the well-known and widely available Brown Corpus of printed English as a standard against which to measure progress in language modeling and offer our bound as the first of what we hope will be a series of steadily decreasing bounds.