Improved grammar-based compressed indexes

Authors:
Francisco Claude;Gonzalo Navarro
Affiliations:
David R. Cheriton School of Computer Science, University of Waterloo, Canada;Department of Computer Science, University of Chile, Chile
Venue:
SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Year:
2012

Citing 29
Cited 0

PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
Approximating the smallest grammar: Kolmogorov complexity in natural models

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Collage system: a unifying framework for compressed pattern matching

Theoretical Computer Science - Selected papers in honour of Setsuo Arikawa
Application of Lempel--Ziv factorization to the approximation of grammar-based compression

Theoretical Computer Science
Real-Time Traversal in Grammar-Based Compressed Files

DCC '05 Proceedings of the Data Compression Conference
Indexing compressed text

Journal of the ACM (JACM)
Representing Trees of Higher Degree

Algorithmica
Compressed full-text indexes

ACM Computing Surveys (CSUR)
A compressed self-index using a Ziv---Lempel dictionary

Information Retrieval
On the Redundancy of Succinct Data Structures

SWAT '08 Proceedings of the 11th Scandinavian workshop on Algorithm Theory
Storage and Retrieval of Individual Genomes

RECOMB 2'09 Proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology
Succinct representations of permutations

ICALP'03 Proceedings of the 30th international conference on Automata, languages and programming
Compressed q-Gram Indexing for Highly Repetitive Biological Sequences

BIBE '10 Proceedings of the 2010 IEEE International Conference on Bioinformatics and Bioengineering
Indexing similar DNA sequences

AAIM'10 Proceedings of the 6th international conference on Algorithmic aspects in information and management
Orthogonal range searching on the RAM, revisited

Proceedings of the twenty-seventh annual symposium on Computational geometry
Self-indexing based on LZ77

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
ESP-index: a compressed index based on edit-sensitive parsing

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Indexes for highly repetitive document collections

Proceedings of the 20th ACM international conference on Information and knowledge management
Reducing the space requirement of LZ-Index

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Random access to grammar-compressed strings

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
A faster grammar-based self-index

LATA'12 Proceedings of the 6th international conference on Language and Automata Theory and Applications
Grammar-based codes: a new class of universal lossless source codes

IEEE Transactions on Information Theory
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory
Compression of individual sequences via variable-rate coding

IEEE Transactions on Information Theory
The smallest grammar problem

IEEE Transactions on Information Theory
Fast relative lempel-ziv self-index for similar sequences

FAW-AAIM'12 Proceedings of the 6th international Frontiers in Algorithmics, and Proceedings of the 8th international conference on Algorithmic Aspects in Information and Management
Self-Indexed Grammar-Based Compression

Fundamenta Informaticae

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce the first grammar-compressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text T[1..u] that is represented by a (context-free) grammar of n (terminal and nonterminal) symbols and size N (measured as the sum of the lengths of the right hands of the rules), a basic grammar-based representation of T takes $N\lg n$ bits of space. Our representation requires $2N\lg n + N\lg u + \epsilon\, n\lg n + o(N\lg n)$ bits of space, for any 0ε≤1. It can find the positions of the occ occurrences of a pattern of length m in T in $O\left((m^2/\epsilon)\lg \left(\frac{\lg u}{\lg n}\right) + (m+occ)\lg n\right)$ time, and extract any substring of length ℓ of T in time $O(\ell+h\lg(N/h))$, where h is the height of the grammar tree.