Smaller self-indexes for natural language

  • Authors:
  • Nieves R. Brisaboa;Gonzalo Navarro;Alberto Ordóñez

  • Affiliations:
  • Database Lab., Univ. of A Coruña, Spain;Dept. of Computer Science, Univ. of Chile, Chile;Database Lab., Univ. of A Coruña, Spain

  • Venue:
  • SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Self-indexes for natural-language texts, where these are regarded as token (word or separator) sequences, achieve very attractive space and search time. However, they suffer from a space penalty due to their large vocabulary. In this paper we show that by replacing the Huffman encoding they implicitly use by the slightly weaker Hu-Tucker encoding, which respects the lexical order of the vocabulary, both their space and time are improved.