The entropy of English using PPM-based models

Authors:
W. J. Teahan;J. G. Cleary
Affiliations:
-;-
Venue:
DCC '96 Proceedings of the Conference on Data Compression
Year:
1996

Citing 0
Cited 11

A general-purpose compression scheme for large collections

ACM Transactions on Information Systems (TOIS)
Implementing the Context Tree Weighting Method for Text Compression

DCC '00 Proceedings of the Conference on Data Compression
A Measure of Information

DCC '00 Proceedings of the Conference on Data Compression
Universal Text Preprocessing for Data Compression

IEEE Transactions on Computers
Revisiting dictionary-based compression: Research Articles

Software—Practice & Experience
Designing for uncertain, asymmetric control: Interaction design for brain-computer interfaces

International Journal of Human-Computer Studies
Scaling high-order character language models to gigabytes

Software '05 Proceedings of the Workshop on Software
A note on brain actuated spelling with the Berlin brain-computer interface

UAHCI'07 Proceedings of the 4th international conference on Universal access in human-computer interaction: ambient interaction
Natural Language Processing (Almost) from Scratch

The Journal of Machine Learning Research
Comparing entropies within the chinese language

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Revisiting the predictability of language: response completion in social media

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

The purpose of this paper is to show that the difference between the best machine models and human models is smaller than might be indicated by the previous results. This follows from a number of observations: firstly, the original human experiments used only 27 character English (letters plus space) against full 128 character ASCII text for most computer experiments; secondly, using large amounts of priming text substantially improves the PPM's performance; and thirdly, the PPM algorithm can be modified to perform better for English text. The result of this is a machine performance down to 1.46 bit per character. The problem of estimating the entropy of English is discussed. The importance of training text for PPM is demonstrated, showing that its performance can be improved by "adjusting" the alphabet used. The results based on these improvements are then given, with compression down to 1.46 bpc.