Implicit indexing of natural language text by reorganizing bytecodes

  • Authors:
  • Nieves R. Brisaboa;Antonio Fariña;Susana Ladra;Gonzalo Navarro

  • Affiliations:
  • Database Laboratory, University of A Coruña, A Coruña, Spain 15071;Database Laboratory, University of A Coruña, A Coruña, Spain 15071;Database Laboratory, University of A Coruña, A Coruña, Spain 15071;Department of Computer Science, University of Chile, Santiago, Chile 2120

  • Venue:
  • Information Retrieval
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Word-based byte-oriented compression has succeeded on large natural language text databases, by providing competitive compression ratios, fast random access, and direct sequential searching. We show that by just rearranging the target symbols of the compressed text into a tree-shaped structure, and using negligible additional space, we obtain a new implicitly indexed representation of the compressed text, where search times are drastically improved. The occurrences of a word can be listed directly, without any text scanning, and in general any inverted-index-like capability, such as efficient phrase searches, can be emulated without storing any inverted list information. We experimentally show that our proposal performs not only much more efficiently than sequential searches over compressed text, but also than explicit inverted indexes and other types of indexes, when using little extra space. Our representation is especially successful when searching for single words and short phrases.