Application of variable length N-gram vectors to monolingual and bilingual information retrieval

Authors:
Daniel Gayo-Avello;Darío Álvarez-Gutiérrez;José Gayo-Avello
Affiliations:
Department of Informatics, University of Oviedo, Oviedo, Spain;Department of Informatics, University of Oviedo, Oviedo, Spain;Department of Informatics, University of Oviedo, Oviedo, Spain
Venue:
CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images
Year:
2004

Citing 6
Cited 1

Searching for text? Send an N-gram]

BYTE
One-time complete indexing of text: theory and practice

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
A vector space model for automatic indexing

Communications of the ACM
Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
What happened in CLEF 2004?

CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images
CLEF 2004: ad hoc track overview and results analysis

CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images

VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Our group in the Department of Informatics at the University of Oviedo has participated, for the first time, in two tasks at CLEF: monolingual (Russian) and bilingual (Spanish-to-English) information retrieval. Our main goal was to test the application to IR of a modified version of the n-gram vector space model (codenamed blindLight). This new approach has been successfully applied to other NLP tasks such as language identification or text summarization and the results achieved at CLEF 2004, although not exceptional, are encouraging. There are two major differences between the blindLight approach and classical techniques: (1) relative frequencies are no longer used as vector weights but are replaced by n-gram significances, and (2) cosine distance is abandoned in favor of a new metric inspired by sequence alignment techniques, not so computationally expensive. In order to perform cross-language IR we have developed a naive n-gram pseudo-translator similar to those described by McNamee and Mayfield or Pirkola et al.