Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric
Journal of the ACM (JACM)
STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Compression of Low Entropy Strings with Lempel--Ziv Algorithms
SIAM Journal on Computing
Height in a digital search tree and the longest phrase of the Lempel-Ziv scheme
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
An analysis of the Burrows—Wheeler transform
Journal of the ACM (JACM)
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Succinct Representation of Balanced Parentheses and Static Trees
SIAM Journal on Computing
High-order entropy-compressed text indexes
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array
ISAAC '00 Proceedings of the 11th International Conference on Algorithms and Computation
Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science
Opportunistic data structures with applications
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Indexing text using the Ziv-Lempel trie
Journal of Discrete Algorithms - SPIRE 2002
New text indexing functionalities of the compressed suffix arrays
Journal of Algorithms
Journal of the ACM (JACM)
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching
SIAM Journal on Computing
Representing Trees of Higher Degree
Algorithmica
Rank/select operations on large alphabets: a tool for text indexing
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Succinct suffix arrays based on run-length encoding
Nordic Journal of Computing
ACM Computing Surveys (CSUR)
A simple optimal representation for balanced parentheses
Theoretical Computer Science
Compressed representations of sequences and full-text indexes
ACM Transactions on Algorithms (TALG)
Ultra-succinct representation of ordered trees
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Succinct indexes for strings, binary relations and multi-labeled trees
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Dynamic entropy-compressed sequences and full-text indexes
ACM Transactions on Algorithms (TALG)
A compressed self-index using a Ziv---Lempel dictionary
Information Retrieval
Implementing the LZ-index: Theory versus practice
Journal of Experimental Algorithmics (JEA)
Compressed text indexes: From theory to practice
Journal of Experimental Algorithmics (JEA)
Succinct representations of permutations
ICALP'03 Proceedings of the 30th international conference on Automata, languages and programming
Reducing the space requirement of LZ-Index
CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Space-efficient construction of LZ-index
ISAAC'05 Proceedings of the 16th international conference on Algorithms and Computation
Stronger Lempel-Ziv Based Compressed Text Indexing
Algorithmica
Efficient implementation of rank and select functions for succinct representation
WEA'05 Proceedings of the 4th international conference on Experimental and Efficient Algorithms
A Lempel-Ziv text index on secondary storage
CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Space-efficient construction of Lempel-Ziv compressed text indexes
Information and Computation
Hi-index | 0.00 |
Given a text T[1¨n] over an alphabet of size σ, the full-text search problem consists in locating the occ occurrences of a given pattern P[1¨m] in T. Compressed full-text self-indices are space-efficient representations of the text that provide direct access to and indexed search on it. The LZ-index of Navarro is a compressed full-text self-index based on the LZ78 compression algorithm. This index requires about 5 times the size of the compressed text (in theory, 4nHk(T)+o(nlogσ) bits of space, where Hk(T) is the k-th order empirical entropy of T). In practice, the average locating complexity of the LZ-index is O(σ m logσ n + occ σm/2), where occ is the number of occurrences of P. It can extract text substrings of length ℓ in O(ℓ) time. This index outperforms competing schemes both to locate short patterns and to extract text snippets. However, the LZ-index can be up to 4 times larger than the smallest existing indices (which use nHk(T)+o(nlogσ) bits in theory), and it does not offer space/time tuning options. This limits its applicability. In this article, we study practical ways to reduce the space of the LZ-index. We obtain new LZ-index variants that require 2(1+&epsis;)nHk(T) + o(nlogσ) bits of space, for any 0O(1/&epsis;(mlog n + occ σm/2)), while extracting takes O(ℓ) time. We perform extensive experimentation and conclude that our schemes are able to reduce the space of the original LZ-index by a factor of 2/3, that is, around 3 times the compressed text size. Our schemes are able to extract about 1 to 2 MB of the text per second, being twice as fast as the most competitive alternatives. Pattern occurrences are located at a rate of up to 1 to 4 million per second. This constitutes the best space/time trade-off when indices are allowed to use 4 times the size of the compressed text or more.