Approximation algorithms for grammar-based compression

Authors:
Eric Lehman;Abhi Shelat
Affiliations:
MIT Laboratory for Computer Science, Cambridge, MA;MIT Laboratory for Computer Science, Cambridge, MA
Venue:
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Year:
2002

Citing 6
Cited 15

Some Theory and Practice of Greedy Off-Line Textual Substitution

DCC '98 Proceedings of the Conference on Data Compression
The Unsupervised Acquisition of a Lexicon from Continuous Speech

The Unsupervised Acquisition of a Lexicon from Continuous Speech
Data compression: methods and complexity issues.

Data compression: methods and complexity issues.
Grammar-based codes: a new class of universal lossless source codes

IEEE Transactions on Information Theory
Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform. I. Without context models

IEEE Transactions on Information Theory
Universal lossless compression via multilevel pattern matching

IEEE Transactions on Information Theory

Approximating the smallest grammar: Kolmogorov complexity in natural models

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Application of Lempel-Ziv Factorization to the Approximation of Grammar-Based Compression

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Application of Lempel--Ziv factorization to the approximation of grammar-based compression

Theoretical Computer Science
Compact representations as a search strategy: compression EDAs

Theoretical Computer Science - Foundations of genetic algorithms
Sublinear Algorithms for Approximating String Compressibility

APPROX '07/RANDOM '07 Proceedings of the 10th International Workshop on Approximation and the 11th International Workshop on Randomization, and Combinatorial Optimization. Algorithms and Techniques
Experiences with model inference assisted fuzzing

WOOT'08 Proceedings of the 2nd conference on USENIX Workshop on offensive technologies
A fully linear-time approximation algorithm for grammar-based compression

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Reverse engineering ECUs of automotive components: a case study

Proceedings of the First International Workshop on Model Inference In Testing
Automatic discovery of unspecified behaviors in automotive control software

TAIC PART'10 Proceedings of the 5th international academic and industrial conference on Testing - practice and research techniques
Finite state complexity

Theoretical Computer Science
Scalable detection of frequent substrings by grammar-based compression

DS'11 Proceedings of the 14th international conference on Discovery science
Efficient memory representation of XML documents

DBPL'05 Proceedings of the 10th international conference on Database Programming Languages
Bridging lossy and lossless compression by motif pattern discovery

General Theory of Information Transfer and Combinatorics
Improving time and space complexity for compressed pattern matching

ISAAC'06 Proceedings of the 17th international conference on Algorithms and Computation
ESP-index: A compressed index based on edit-sensitive parsing

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several recently-proposed data compression algorithms are based on the idea of representing a string by a context-free grammar. Most of these algorithms are known to be asymptotically optimal with respect to a stationary ergodic source and to achieve a low redundancy rate. However, such results do not reveal how effectively these algorithms exploit the grammar-model itself; that is, are the compressed strings produced as small as possible? We address this issue by analyzing the approximation ratio of several algorithms, that is, the maximum ratio between the size of the generated grammar and the smallest possible grammar over all inputs. On the negative side, we show that every polynomial-time grammar-compression algorithm has approximation ratio at least 8569/8568 unless P = NP. Moreover, achieving an approximation ratio of o(log n/log log n) would require progress on an algebraic problem in a well-studied area. We then upper and lower bound approximation ratios for the following four previously-proposed grammar-based compression algorithms: SEQUENTIAL, BISECTION, GREEDY, and LZ78, each of which employs a distinct approach to compression. These results seem to indicate that there is much room to improve grammar-based compression algorithms.