A Lempel-Ziv text index on secondary storage

Authors:
Diego Arroyuelo;Gonzalo Navarro
Affiliations:
Dept. of Computer Science, Universidad de Chile, Santiago, Chile;Dept. of Computer Science, Universidad de Chile, Santiago, Chile
Venue:
CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Year:
2007

Citing 21
Cited 7

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Hierarchies of indices for text searching

Information Systems
The string B-tree: a new data structure for string search in external memory and its applications

Journal of the ACM (JACM)
Fast string searching in secondary storage: theoretical developments and experimental results

Proceedings of the seventh annual ACM-SIAM symposium on Discrete algorithms
Efficient suffix trees on secondary storage

Proceedings of the seventh annual ACM-SIAM symposium on Discrete algorithms
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
Reducing the space requirement of suffix trees

Software—Practice & Experience
Compression of Low Entropy Strings with Lempel--Ziv Algorithms

SIAM Journal on Computing
Fast and flexible word searching on compressed text

ACM Transactions on Information Systems (TOIS)
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
Succinct representations of lcp information and improvements in the compressed suffix arrays

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Succinct Representation of Balanced Parentheses and Static Trees

SIAM Journal on Computing
Tables

Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science
Indexing text using the Ziv-Lempel trie

Journal of Discrete Algorithms - SPIRE 2002
Indexing compressed text

Journal of the ACM (JACM)
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Compressed Text Indexes with Fast Locate

CPM '07 Proceedings of the 18th annual symposium on Combinatorial Pattern Matching
Reducing the space requirement of LZ-Index

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Space-efficient construction of LZ-index

ISAAC'05 Proceedings of the 16th international conference on Algorithms and Computation
Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays

ISAAC'04 Proceedings of the 15th international conference on Algorithms and Computation
Compression of individual sequences via variable-rate coding

IEEE Transactions on Information Theory

Algorithms and data structures for external memory

Foundations and Trends® in Theoretical Computer Science
Implementing the LZ-index: Theory versus practice

Journal of Experimental Algorithmics (JEA)
An Improved Succinct Representation for Dynamic k-ary Trees

CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
On Entropy-Compressed Text Indexing in External Memory

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Compression, indexing, and retrieval for massive string data

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Data structures: time, I/Os, entropy, joules!

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Practical approaches to reduce the space requirement of lempel-ziv--based compressed text indices

Journal of Experimental Algorithmics (JEA)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Full-text searching consists in locating the occurrences of a given pattern P[1..m] in a text T[1..u], both sequences over an alphabet of size σ. In this paper we define a new index for full-text searching on secondary storage, based on the Lempel-Ziv compression algorithm and requiring 8uHk +o(u log σ) bits of space, where Hk denotes the k-th order empirical entropy of T, for any k = o(logσ u). Our experimental results show that our index is significantly smaller than any other practical secondary-memory data structure: 1.4-2.3 times the text size including the text, which means 39%-65% the size of traditional indexes like String B-trees [Ferragina and Grossi, JACM 1999]. In exchange, our index requires more disk access to locate the pattern occurrences. Our index is able to report up to 600 occurrences per disk access, for a disk page of 32 kilobytes. If we only need to count pattern occurrences, the space can be reduced to about 1.04-1.68 times the text size, requiring about 20-60 disk accesses, depending on the pattern length.