High-order entropy-compressed text indexes

  • Authors:
  • Roberto Grossi;Ankur Gupta;Jeffrey Scott Vitter

  • Affiliations:
  • Università di Pisa, Pisa;Center for Geometric and Biological Computing, Durham, NC;Purdue University, West Lafayette, IN

  • Venue:
  • SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
  • Year:
  • 2003

Quantified Score

Hi-index 0.02

Visualization

Abstract

We present a novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet σ, where each symbol is encoded by lg|σ| bits. We show that compressed suffix arrays use just nHh + σ bits, while retaining full text indexing functionalities, such as searching any pattern sequence of length m in O(m lg |σ| + polylog(n)) time. The term Hh ≤ lg |σ| denotes the hth-order empirical entropy of the text, which means that our index is nearly optimal in space apart from lower-order terms, achieving asymptotically the empirical entropy of the text (with a multiplicative constant 1). If the text is highly compressible so that Hn = o(1) and the alphabet size is small, we obtain a text index with o(m) search time that requires only o(n) bits. Further results and tradeoffs are reported in the paper.