A New Approach for Document Indexing UsingWavelet Trees

  • Authors:
  • Nieves R. Brisaboa;Yolanda Cillero;Antonio Farina;Susana Ladra;Oscar Pedreira

  • Affiliations:
  • University of A Coruna, Spain;University of A Coruna, Spain;University of A Coruna, Spain;University of A Coruna, Spain;University of A Coruna, Spain

  • Venue:
  • DEXA '07 Proceedings of the 18th International Conference on Database and Expert Systems Applications
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The development of applications that manage large text collections needs indexing methods which allow efficient retrieval over text. Several indexes have been proposed which try to reach a good trade-off between the space needed to store both the text and the index, and its search efficiency. Self-indexes are becoming more and more popular in the last years. Not only they index the text, but they keep enough information to recover any portion of it without the need of keeping it explicitly. Therefore, they actually replace the text. In this paper, we focus in a self-index known as wavelet tree. Being originally organized as a binary tree, it was designed to index the characters from a text. We present three variants of this method that aim at reducing its size, while keeping a good trade-off between space and performance, as well as making it well-suited for indexing natural language texts. The first approach we describe joins Huffman compression and wavelet trees. The other two new variants index words instead of characters and use two different word-based compressors.