Self-indexing Natural Language

  • Authors:
  • Nieves R. Brisaboa;Antonio Fariña;Gonzalo Navarro;Angeles S. Places;Eduardo Rodríguez

  • Affiliations:
  • Database Lab., Univ. da Coruña, Spain;Database Lab., Univ. da Coruña, Spain;Dept. of Computer Science, Univ. of Chile,;Database Lab., Univ. da Coruña, Spain;Database Lab., Univ. da Coruña, Spain

  • Venue:
  • SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
  • Year:
  • 2008

Quantified Score

Hi-index 0.01

Visualization

Abstract

Self-indexing is a concept developed for indexing arbitrary strings. It has been enormously successful to reduce the size of the large indexes typically used on strings, namely suffix trees and arrays. Self-indexes represent a string in a space close to its compressed size and provide indexed searching on it. On natural language, a compressed inverted index over the compressed text already provides a reasonable alternative, in space and time, for indexed searching of words and phrases. In this paper we explore the possibility of regarding natural language text as a string of words and applying a self-index to it. There are several challenges involved, such as dealing with a very large alphabet and detaching searchable content from non-searchable presentation aspects in the text. As a result, we show that the self-index requires space very close to that of the best word-based compressors, and that it obtains better search time than inverted indexes (using the same overall space) when searching for phrases.