Self-indexing based on LZ77

Authors:
Sebastian Kreft;Gonzalo Navarro
Affiliations:
Dept. of Computer Science, University of Chile, Santiago, Chile;Dept. of Computer Science, University of Chile, Santiago, Chile
Venue:
CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Year:
2011

Citing 24
Cited 15

PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Efficient algorithms for document retrieval problems

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Indexing text using the Ziv-Lempel trie

Journal of Discrete Algorithms - SPIRE 2002
New text indexing functionalities of the compressed suffix arrays

Journal of Algorithms
Indexing compressed text

Journal of the ACM (JACM)
Representing Trees of Higher Degree

Algorithmica
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Rank and select revisited and extended

Theoretical Computer Science
A compressed self-index using a Ziv---Lempel dictionary

Information Retrieval
Succincter

FOCS '08 Proceedings of the 2008 49th Annual IEEE Symposium on Foundations of Computer Science
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Directly Addressable Variable-Length Codes

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Succinct representations of permutations

ICALP'03 Proceedings of the 30th international conference on Automata, languages and programming
LZ77-Like Compression with Fast Random Access

DCC '10 Proceedings of the 2010 Data Compression Conference
Advantages of Shared Data Structures for Sequences of Balanced Parentheses

DCC '10 Proceedings of the 2010 Data Compression Conference
Compressed q-Gram Indexing for Highly Repetitive Biological Sequences

BIBE '10 Proceedings of the 2010 IEEE International Conference on Bioinformatics and Bioengineering
Improved index compression techniques for versioned document collections

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Optimal succinctness for range minimum queries

LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics

Space efficient wavelet tree construction

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Reference sequence construction for relative compression of genomes

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Indexes for highly repetitive document collections

Proceedings of the 20th ACM international conference on Information and knowledge management
Faster approximate pattern matching in compressed repetitive texts

ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
A faster grammar-based self-index

LATA'12 Proceedings of the 6th international conference on Language and Automata Theory and Applications
Fast relative lempel-ziv self-index for similar sequences

FAW-AAIM'12 Proceedings of the 6th international Frontiers in Algorithmics, and Proceedings of the 8th international conference on Algorithmic Aspects in Information and Management
Wavelet trees for all

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Computing lempel-ziv factorization online

MFCS'12 Proceedings of the 37th international conference on Mathematical Foundations of Computer Science
DACs: Bringing direct access to variable-length codes

Information Processing and Management: an International Journal
Compressed suffix trees for repetitive texts

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
The wavelet matrix

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Improved grammar-based compressed indexes

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
ESP-index: A compressed index based on edit-sensitive parsing

Journal of Discrete Algorithms
On compressing and indexing repetitive sequences

Theoretical Computer Science
Wavelet trees for all

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that source of compressibility. Our self-index takes in practice a few times the space of the text compressed with LZ77 (as little as 2.5 times), extracts 1-2 million characters of the text per second, and finds patterns at a rate of 10-50 microseconds per occurrence. It is smaller (up to one half) than the best current self-index for repetitive collections, and faster in many cases.