Self-Indexed Grammar-Based Compression

Authors:
Francisco Claude;Gonzalo Navarro
Affiliations:
(Correspd.) (Funded in part by NSERC Canada, Go-Bell Scholarships program and David R. Cheriton Graduate Scholarships program.) David R. Cheriton School of Computer Science, University of Waterloo ...;(Funded in part by Millennium Institute on Cell Dynamics and Biotechnology (ICDB), Grant ICM P05-001-F, Mideplan, Chile) Department of Computer Science, University of Chile, Chile. gnavarro@dcc.uc ...
Venue:
Fundamenta Informaticae
Year:
2011

Citing 40
Cited 11

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Compact pat trees

Compact pat trees
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
Data compression via textual substitution

Journal of the ACM (JACM)
On the sorting-complexity of suffix tree construction

Journal of the ACM (JACM)
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Collage system: a unifying framework for compressed pattern matching

Theoretical Computer Science - Selected papers in honour of Setsuo Arikawa
Offline Dictionary-Based Compression

DCC '99 Proceedings of the Conference on Data Compression
Application of Lempel--Ziv factorization to the approximation of grammar-based compression

Theoretical Computer Science
Some Theory and Practice of Greedy Off-Line Textual Substitution

DCC '98 Proceedings of the Conference on Data Compression
Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Time/space efficient compressed pattern matching

Fundamenta Informaticae - Special issue on computing patterns in strings
Indexing text using the Ziv-Lempel trie

Journal of Discrete Algorithms - SPIRE 2002
New text indexing functionalities of the compressed suffix arrays

Journal of Algorithms
The level ancestor problem simplified

Theoretical Computer Science - Latin American theorotical informatics
0(\sqrt {\log n)} Approximation to SPARSEST CUT in Õ(n2) Time

FOCS '04 Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science
Real-Time Traversal in Grammar-Based Compressed Files

DCC '05 Proceedings of the Data Compression Conference
Indexing compressed text

Journal of the ACM (JACM)
Rank/select operations on large alphabets: a tool for text indexing

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Rank and select revisited and extended

Theoretical Computer Science
Compressed Suffix Trees with Full Functionality

Theory of Computing Systems
A compressed self-index using a Ziv---Lempel dictionary

Information Retrieval
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Succinct representations of permutations

ICALP'03 Proceedings of the 30th international conference on Automata, languages and programming
Simple linear work suffix array construction

ICALP'03 Proceedings of the 30th international conference on Automata, languages and programming
LZ77-Like Compression with Fast Random Access

DCC '10 Proceedings of the 2010 Data Compression Conference
Compressed q-Gram Indexing for Highly Repetitive Biological Sequences

BIBE '10 Proceedings of the 2010 IEEE International Conference on Bioinformatics and Bioengineering
Fully-functional succinct trees

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Adaptive searching in succinctly encoded binary relations and tree-structured documents

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Reducing the space requirement of LZ-Index

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Optimal succinctness for range minimum queries

LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics
Compact rich-functional binary relation representations

LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics
Grammar-based codes: a new class of universal lossless source codes

IEEE Transactions on Information Theory
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory
Compression of individual sequences via variable-rate coding

IEEE Transactions on Information Theory
The smallest grammar problem

IEEE Transactions on Information Theory

Wavelet trees for all

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Speeding up q-gram mining on grammar-based compressed texts

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
The wavelet matrix

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Improved grammar-based compressed indexes

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Variable-Length codes for space-efficient grammar-based compression

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Fast q-gram mining on SLP compressed strings

Journal of Discrete Algorithms
ESP-index: A compressed index based on edit-sensitive parsing

Journal of Discrete Algorithms
On compressing and indexing repetitive sequences

Theoretical Computer Science
Fingerprints in compressed strings

WADS'13 Proceedings of the 13th international conference on Algorithms and Data Structures
Compact binary relation representations with rich functionality

Information and Computation
Wavelet trees for all

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

Self-indexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current self-indexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several applications. Grammar-based compression is well suited to exploit such repetitiveness. We introduce the first grammar-based self-index. It builds on Straight-Line Programs (SLPs), a rather general kind of context-free grammars. If an SLP of n rules represents a text T[1, u], then an SLP-compressed representation of T requires 2n log 2 n bits. For that same SLP, our self-index takes O(n log n) + n log 2 u bits. It extracts any text substring of length m in time O((m + h) log n), and finds occ occurrences of a pattern string of length m in time O((m(m + h) + h occ) log n), where h is the height of the parse tree of the SLP. No previous grammar representation had achieved o(n) search time. As byproducts we introduce (i) a representation of SLPs that takes 2n log 2 n(1 + o(1)) bits and efficiently supports more operations than a plain array of rules; (ii) a representation for binary relations with labels supporting various extended queries; (iii) a generalization of our self-index to grammar compressors that reduce T to a sequence of terminals and nonterminals, such as Re-Pair and LZ78.