A New Approach for Document Indexing UsingWavelet Trees

Authors:
Nieves R. Brisaboa;Yolanda Cillero;Antonio Farina;Susana Ladra;Oscar Pedreira
Affiliations:
University of A Coruna, Spain;University of A Coruna, Spain;University of A Coruna, Spain;University of A Coruna, Spain;University of A Coruna, Spain
Venue:
DEXA '07 Proceedings of the 18th International Conference on Database and Expert Systems Applications
Year:
2007

Citing 0
Cited 1

A New Point Access Method Based on Wavelet Trees

ER '09 Proceedings of the ER 2009 Workshops (CoMoL, ETheCoM, FP-UML, MOST-ONISW, QoIS, RIGiM, SeCoGIS) on Advances in Conceptual Modeling - Challenging Perspectives

Quantified Score

Hi-index	0.00

Visualization

Abstract

The development of applications that manage large text collections needs indexing methods which allow efficient retrieval over text. Several indexes have been proposed which try to reach a good trade-off between the space needed to store both the text and the index, and its search efficiency. Self-indexes are becoming more and more popular in the last years. Not only they index the text, but they keep enough information to recover any portion of it without the need of keeping it explicitly. Therefore, they actually replace the text. In this paper, we focus in a self-index known as wavelet tree. Being originally organized as a binary tree, it was designed to index the characters from a text. We present three variants of this method that aim at reducing its size, while keeping a good trade-off between space and performance, as well as making it well-suited for indexing natural language texts. The first approach we describe joins Huffman compression and wavelet trees. The other two new variants index words instead of characters and use two different word-based compressors.