The smallest grammar problem

Authors:
M. Charikar;E. Lehman;D. Liu;R. Panigrahy;M. Prabhakaran;A. Sahai;A. Shelat
Affiliations:
Dept. of Comput. Sci., Princeton Univ., NJ, USA;-;-;-;-;-;-
Venue:
IEEE Transactions on Information Theory
Year:
2005

Citing 0
Cited 39

The complexity of tree automata and XPath on grammar-compressed trees

Theoretical Computer Science - Implementation and application of automata
Efficient memory representation of XML document trees

Information Systems
Context-Sensitive Grammar Transform: Compression and Pattern Matching

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
On the Value of Multiple Read/Write Streams for Data Compression

CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
A bisection algorithm for grammar-based compression of ordered trees

Information Processing Letters
Fast and Compact Web Graph Representations

ACM Transactions on the Web (TWEB)
Improved approximation algorithms for minimum AND-circuits problem via k-set cover

Information Processing Letters
Leaf languages and string compression

Information and Computation
Compressed string dictionaries

SEA'11 Proceedings of the 10th international conference on Experimental algorithms
Lower bounds for context-free grammars

Information Processing Letters
Natural Language Compression on Edge-Guided text preprocessing

Information Sciences: an International Journal
Scalable detection of frequent substrings by grammar-based compression

DS'11 Proceedings of the 14th international conference on Discovery science
Fast q-gram mining on SLP compressed strings

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Iterative Dictionary Construction for Compression of Large DNA Data Sets

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Approximability of minimum AND-Circuits

SWAT'06 Proceedings of the 10th Scandinavian conference on Algorithm Theory
Random access to grammar-compressed strings

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Querying and embedding compressed texts

MFCS'06 Proceedings of the 31st international conference on Mathematical Foundations of Computer Science
Searching for smallest grammars on large sequences and application to DNA

Journal of Discrete Algorithms
Choosing word occurrences for the smallest grammar problem

LATA'10 Proceedings of the 4th international conference on Language and Automata Theory and Applications
Grammar-based compression in a streaming model

LATA'10 Proceedings of the 4th international conference on Language and Automata Theory and Applications
A faster grammar-based self-index

LATA'12 Proceedings of the 6th international conference on Language and Automata Theory and Applications
Parameter reduction and automata evaluation for grammar-compressed trees

Journal of Computer and System Sciences
Self-Indexed Grammar-Based Compression

Fundamenta Informaticae
Algorithms and limits for compact plan representations

Journal of Artificial Intelligence Research
Improved grammar-based compressed indexes

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Faster algorithm for computing the edit distance between SLP-Compressed strings

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Grammar precompression speeds up burrows---wheeler compression

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Variable-Length codes for space-efficient grammar-based compression

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Fast q-gram mining on SLP compressed strings

Journal of Discrete Algorithms
ESP-index: A compressed index based on edit-sensitive parsing

Journal of Discrete Algorithms
An effective heuristic for the smallest grammar problem

Proceedings of the 15th annual conference on Genetic and evolutionary computation
Complexity of counting output patterns of logic circuits

CATS '13 Proceedings of the Nineteenth Computing: The Australasian Theory Symposium - Volume 141
Tree compression with top trees

ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part I
Fingerprints in compressed strings

WADS'13 Proceedings of the 13th international conference on Algorithms and Data Structures
XML tree structure compression using RePair

Information Systems
On the value of multiple read/write streams for data compression

Information Theory, Combinatorics, and Search Theory
Finding the smallest binarization of a CFG is NP-hard

Journal of Computer and System Sciences
A quadsection algorithm for grammar-based image compression

Integrated Computer-Aided Engineering - Anniversary Volume: Celebrating 20 Years of Excellence
Guest column: the elusive inapproximability of the TSP

ACM SIGACT News

Quantified Score

Hi-index	754.84

Visualization

Abstract

This paper addresses the smallest grammar problem: What is the smallest context-free grammar that generates exactly one given string σ? This is a natural question about a fundamental object connected to many fields such as data compression, Kolmogorov complexity, pattern identification, and addition chains. Due to the problem's inherent complexity, our objective is to find an approximation algorithm which finds a small grammar for the input string. We focus attention on the approximation ratio of the algorithm (and implicitly, the worst case behavior) to establish provable performance guarantees and to address shortcomings in the classical measure of redundancy in the literature. Our first results are concern the hardness of approximating the smallest grammar problem. Most notably, we show that every efficient algorithm for the smallest grammar problem has approximation ratio at least 8569/8568 unless P=NP. We then bound approximation ratios for several of the best known grammar-based compression algorithms, including LZ78, B ISECTION, SEQUENTIAL, LONGEST MATCH, GREEDY, and RE-PAIR. Among these, the best upper bound we show is O(n12/). We finish by presenting two novel algorithms with exponentially better ratios of O(log3n) and O(log(n/m*)), where m* is the size of the smallest grammar for that input. The latter algorithm highlights a connection between grammar-based compression and LZ77.