Prediction by Grammatical Match

Authors:
J. Michael Lake
Affiliations:
-
Venue:
DCC '00 Proceedings of the Conference on Data Compression
Year:
2000

Citing 11
Cited 3

An analysis of the longest match and the greedy heuristics in text encoding

Journal of the ACM (JACM)
Artificial intelligence: a modern approach

Artificial intelligence: a modern approach
The design and analysis of efficient lossless data compression systems

The design and analysis of efficient lossless data compression systems
Efficient Computation of LALR(1) Look-Ahead Sets

ACM Transactions on Programming Languages and Systems (TOPLAS)
An efficient context-free parsing algorithm

Communications of the ACM
A Percolating State Selector for Suffix-Tree Context Models

DCC '97 Proceedings of the Conference on Data Compression
A New Compression Scheme for Syntactically Structured Messages (Programs) and its Application to Java and the Internet

DCC '98 Proceedings of the Conference on Data Compression
Preprocessing Text to Improve Compression Ratios

DCC '98 Proceedings of the Conference on Data Compression
Compression via Guided Parsing

DCC '98 Proceedings of the Conference on Data Compression
An efficient context-free parsing algorithm for natural languages and its applications

An efficient context-free parsing algorithm for natural languages and its applications
On-line stochastic processes in data compression

On-line stochastic processes in data compression

Bytecode compression via profiled grammar rewriting

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Compressing XML with Multiplexed Hierarchical PPM Models

DCC '01 Proceedings of the Data Compression Conference
Compression of Annotated Nucleotide Sequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present Prediction by Grammatical Match (PGM), a new general-purpose adaptive text compression framework successfully blending finite-context and general context-free models. A PGM compressor operates incrementally by parsing a prefix of the input text, generating a set of analyses; these analyses are scored according to encoding cost, the cheapest is selected, and sent through an order k PPM encoder.PGM's primary innovations include the use of a generalized PPM in selection and coding; the simultaneous use of multiple context-free grammars; the use of lexical left-corner derivations (LLCD); and an aggressive algorithm for constructing an LR (0) parsable metalanguage for LLCDs. LLCDs are a hybrid of bottom-up and top-down descriptions that represent grammatical information implicitly with each lexeme.The constructed metalanguage extends this to include explicit top-down steps to resolve local ambiguities in at most one strictly grammatical symbol. These properties combine to deliver excellent compression. On a test corpus of about 1 Mb of Scheme program text, PGM with a generic Scheme grammar required about 26% fewer bits than PPM to represent the entire corpus, with reductions on individual files reaching as high as 55%. In addition, PGM enriches the time-compression-memory tradeoff options, since a low order PGM can achieve bpc rates comparable to high order PPMs at considerable savings in space. PGM compression operates in expected linear time and space for many kinds of grammars. PGM decompression operates in guaranteed linear time and space.