Compression, information theory, and grammars: a unified approach

Authors:
Abraham Bookstein;Shmuel T. Klein
Affiliations:
Univ. of Chicago, Chicago, IL;Univ. of Chicago, Chicago, IL
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
1990

Citing 12
Cited 5

Coding and information theory (2nd ed.)

Coding and information theory (2nd ed.)
Data compression for a source with Markov characteristics

The Computer Journal
Data compression: methods and theory

Data compression: methods and theory
Data compression

ACM Computing Surveys (CSUR)
Storing text retrieval systems on CD-ROM: compression and encryption considerations

ACM Transactions on Information Systems (TOIS)
Arithmetic coding for data compression

Communications of the ACM
The art of computer programming, volume 1 (3rd ed.): fundamental algorithms

The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
Experiments in text file compression

Communications of the ACM
Graph Algorithms

Graph Algorithms
Information Retrieval: Computational and Theoretical Aspects

Information Retrieval: Computational and Theoretical Aspects
Introduction To Automata Theory, Languages, And Computation

Introduction To Automata Theory, Languages, And Computation
The Design and Analysis of Computer Algorithms

The Design and Analysis of Computer Algorithms

Is Huffman coding dead? (extended abstract)

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Skeleton Trees for the Efficient Decoding of Huffman Encoded Texts

Information Retrieval
Improving Static Compression Schemes by Alphabet Extension

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Automatic Alphabet Recognition

Information Retrieval
Compression of Annotated Nucleotide Sequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text compression is of considerable theoretical and practical interest. It is, for example, becoming increasingly important for satisfying the requirements of fitting a large database onto a single CD-ROM. Many of the compression techniques discussed in the literature are model based. We here propose the notion of a formal grammar as a flexible model of text generation that encompasses most of the models offered before as well as, in principle, extending the possibility of compression to a much more general class of languages. Assuming a general model of text generation, a derivation is given of the well known Shannon entropy formula, making possible a theory of information based upon text representation rather than on communication. The ideas are shown to apply to a number of commonly used text models. Finally, we focus on a Markov model of text generation, suggest an information theoretic measure of similarity between two probability distributions, and develop a clustering algorithm based on this measure. This algorithm allows us to cluster Markov states, and thereby base our compression algorithm on a smaller number of probability distributions than would otherwise have been required. A number of theoretical consequences of this approach to compression are explored, and a detailed example is given.