Coding and information theory (2nd ed.)
Coding and information theory (2nd ed.)
Data compression for a source with Markov characteristics
The Computer Journal
Data compression: methods and theory
Data compression: methods and theory
ACM Computing Surveys (CSUR)
Storing text retrieval systems on CD-ROM: compression and encryption considerations
ACM Transactions on Information Systems (TOIS)
Arithmetic coding for data compression
Communications of the ACM
The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
Experiments in text file compression
Communications of the ACM
Graph Algorithms
Information Retrieval: Computational and Theoretical Aspects
Information Retrieval: Computational and Theoretical Aspects
Introduction To Automata Theory, Languages, And Computation
Introduction To Automata Theory, Languages, And Computation
The Design and Analysis of Computer Algorithms
The Design and Analysis of Computer Algorithms
Is Huffman coding dead? (extended abstract)
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Skeleton Trees for the Efficient Decoding of Huffman Encoded Texts
Information Retrieval
Improving Static Compression Schemes by Alphabet Extension
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Automatic Alphabet Recognition
Information Retrieval
Compression of Annotated Nucleotide Sequences
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Hi-index | 0.00 |
Text compression is of considerable theoretical and practical interest. It is, for example, becoming increasingly important for satisfying the requirements of fitting a large database onto a single CD-ROM. Many of the compression techniques discussed in the literature are model based. We here propose the notion of a formal grammar as a flexible model of text generation that encompasses most of the models offered before as well as, in principle, extending the possibility of compression to a much more general class of languages. Assuming a general model of text generation, a derivation is given of the well known Shannon entropy formula, making possible a theory of information based upon text representation rather than on communication. The ideas are shown to apply to a number of commonly used text models. Finally, we focus on a Markov model of text generation, suggest an information theoretic measure of similarity between two probability distributions, and develop a clustering algorithm based on this measure. This algorithm allows us to cluster Markov states, and thereby base our compression algorithm on a smaller number of probability distributions than would otherwise have been required. A number of theoretical consequences of this approach to compression are explored, and a detailed example is given.