Practical approaches to reduce the space requirement of lempel-ziv--based compressed text indices

Authors:
Diego Arroyuelo;Gonzalo Navarro
Affiliations:
Yahoo! Research Chile, Santiago, Chile;University of Chile, Santiago, Chile
Venue:
Journal of Experimental Algorithmics (JEA)
Year:
2010

Citing 35
Cited 1

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Compression of Low Entropy Strings with Lempel--Ziv Algorithms

SIAM Journal on Computing
Height in a digital search tree and the longest phrase of the Lempel-Ziv scheme

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Succinct Representation of Balanced Parentheses and Static Trees

SIAM Journal on Computing
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array

ISAAC '00 Proceedings of the 11th International Conference on Algorithms and Computation
Tables

Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Indexing text using the Ziv-Lempel trie

Journal of Discrete Algorithms - SPIRE 2002
New text indexing functionalities of the compressed suffix arrays

Journal of Algorithms
Indexing compressed text

Journal of the ACM (JACM)
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

SIAM Journal on Computing
Representing Trees of Higher Degree

Algorithmica
Rank/select operations on large alphabets: a tool for text indexing

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Succinct suffix arrays based on run-length encoding

Nordic Journal of Computing
Compressed full-text indexes

ACM Computing Surveys (CSUR)
A simple optimal representation for balanced parentheses

Theoretical Computer Science
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays

Algorithmica
Ultra-succinct representation of ordered trees

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Succinct indexes for strings, binary relations and multi-labeled trees

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Dynamic entropy-compressed sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
A compressed self-index using a Ziv---Lempel dictionary

Information Retrieval
Implementing the LZ-index: Theory versus practice

Journal of Experimental Algorithmics (JEA)
Compressed text indexes: From theory to practice

Journal of Experimental Algorithmics (JEA)
Succinct representations of permutations

ICALP'03 Proceedings of the 30th international conference on Automata, languages and programming
Reducing the space requirement of LZ-Index

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Space-efficient construction of LZ-index

ISAAC'05 Proceedings of the 16th international conference on Algorithms and Computation
Stronger Lempel-Ziv Based Compressed Text Indexing

Algorithmica
Efficient implementation of rank and select functions for succinct representation

WEA'05 Proceedings of the 4th international conference on Experimental and Efficient Algorithms
A Lempel-Ziv text index on secondary storage

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching

Space-efficient construction of Lempel-Ziv compressed text indexes

Information and Computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a text T[1¨n] over an alphabet of size σ, the full-text search problem consists in locating the occ occurrences of a given pattern P[1¨m] in T. Compressed full-text self-indices are space-efficient representations of the text that provide direct access to and indexed search on it. The LZ-index of Navarro is a compressed full-text self-index based on the LZ78 compression algorithm. This index requires about 5 times the size of the compressed text (in theory, 4nHk(T)+o(nlogσ) bits of space, where Hk(T) is the k-th order empirical entropy of T). In practice, the average locating complexity of the LZ-index is O(σ m logσ n + occ σm/2), where occ is the number of occurrences of P. It can extract text substrings of length ℓ in O(ℓ) time. This index outperforms competing schemes both to locate short patterns and to extract text snippets. However, the LZ-index can be up to 4 times larger than the smallest existing indices (which use nHk(T)+o(nlogσ) bits in theory), and it does not offer space/time tuning options. This limits its applicability. In this article, we study practical ways to reduce the space of the LZ-index. We obtain new LZ-index variants that require 2(1+&epsis;)nHk(T) + o(nlogσ) bits of space, for any 0O(1/&epsis;(mlog n + occ σm/2)), while extracting takes O(ℓ) time. We perform extensive experimentation and conclude that our schemes are able to reduce the space of the original LZ-index by a factor of 2/3, that is, around 3 times the compressed text size. Our schemes are able to extract about 1 to 2 MB of the text per second, being twice as fast as the most competitive alternatives. Pattern occurrences are located at a rate of up to 1 to 4 million per second. This constitutes the best space/time trade-off when indices are allowed to use 4 times the size of the compressed text or more.