On Entropy-Compressed Text Indexing in External Memory

  • Authors:
  • Wing-Kai Hon;Rahul Shah;Sharma V. Thankachan;Jeffrey Scott Vitter

  • Affiliations:
  • Department of Computer Science, National Tsing Hua University, Taiwan;Department of Computer Science, Louisiana State University, US;Department of Computer Science, Louisiana State University, US;Department of Computer Science, Texas A & M University, USA

  • Venue:
  • SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

A new trend in the field of pattern matching is to design indexing data structures which take space very close to that required by the indexed text (in entropy-compressed form) and also simultaneously achieve good query performance. Two popular indexes, namely the FM-index [Ferragina and Manzini, 2005] and the CSA [Grossi and Vitter 2005], achieve this goal by exploiting the Burrows-Wheeler transform (BWT) [Burrows and Wheeler, 1994]. However, due to the intricate permutation structure of BWT, no locality of reference can be guaranteed when we perform pattern matching with these indexes. Chien et al. [2008] gave an alternative text index which is based on sparsifying the traditional suffix tree and maintaining an auxiliary 2-D range query structure. Given a text T of length n drawn from a *** -sized alphabet set, they achieved O (n log*** )-bit index for T and showed that this index can preserve locality in pattern matching and hence is amenable to be used in external-memory settings. We improve upon this index and show how to apply entropy compression to reduce index space. Our index takes O (n (H k + 1)) + o (n log*** ) bits of space where H k is the k th-order empirical entropy of the text. This is achieved by creating variable length blocks of text using arithmetic coding.