On Entropy-Compressed Text Indexing in External Memory

Authors:
Wing-Kai Hon;Rahul Shah;Sharma V. Thankachan;Jeffrey Scott Vitter
Affiliations:
Department of Computer Science, National Tsing Hua University, Taiwan;Department of Computer Science, Louisiana State University, US;Department of Computer Science, Louisiana State University, US;Department of Computer Science, Texas A & M University, USA
Venue:
SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Year:
2009

Citing 17
Cited 7

The input/output complexity of sorting and related problems

Communications of the ACM
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
The string B-tree: a new data structure for string search in external memory and its applications

Journal of the ACM (JACM)
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Sparse Suffix Trees

COCOON '96 Proceedings of the Second Annual International Conference on Computing and Combinatorics
New text indexing functionalities of the compressed suffix arrays

Journal of Algorithms
Indexing compressed text

Journal of the ACM (JACM)
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

SIAM Journal on Computing
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Compressed Suffix Trees with Full Functionality

Theory of Computing Systems
Geometric Burrows-Wheeler Transform: Linking Range Searching and Text Indexing

DCC '08 Proceedings of the Data Compression Conference
Compressed Index for Dictionary Matching

DCC '08 Proceedings of the Data Compression Conference
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Efficient Data Structures for the Orthogonal Range Successor Problem

COCOON '09 Proceedings of the 15th Annual International Conference on Computing and Combinatorics
A Lempel-Ziv text index on secondary storage

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching

Compression, indexing, and retrieval for massive string data

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Data structures: time, I/Os, entropy, joules!

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Fully compressed suffix trees

ACM Transactions on Algorithms (TALG)
Compressed text indexing with wildcards

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
A faster grammar-based self-index

LATA'12 Proceedings of the 6th international conference on Language and Automata Theory and Applications
Computing lempel-ziv factorization online

MFCS'12 Proceedings of the 37th international conference on Mathematical Foundations of Computer Science
Compressed text indexing with wildcards

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

A new trend in the field of pattern matching is to design indexing data structures which take space very close to that required by the indexed text (in entropy-compressed form) and also simultaneously achieve good query performance. Two popular indexes, namely the FM-index [Ferragina and Manzini, 2005] and the CSA [Grossi and Vitter 2005], achieve this goal by exploiting the Burrows-Wheeler transform (BWT) [Burrows and Wheeler, 1994]. However, due to the intricate permutation structure of BWT, no locality of reference can be guaranteed when we perform pattern matching with these indexes. Chien et al. [2008] gave an alternative text index which is based on sparsifying the traditional suffix tree and maintaining an auxiliary 2-D range query structure. Given a text T of length n drawn from a *** -sized alphabet set, they achieved O (n log*** )-bit index for T and showed that this index can preserve locality in pattern matching and hence is amenable to be used in external-memory settings. We improve upon this index and show how to apply entropy compression to reduce index space. Our index takes O (n (H k + 1)) + o (n log*** ) bits of space where H k is the k th-order empirical entropy of the text. This is achieved by creating variable length blocks of text using arithmetic coding.