Vocabulary reduction and text enrichment at WebCLEF

Authors:
Franco Rojas;Héctor Jiménez-Salazar;David Pinto
Affiliations:
Department of Information Systems and Computation, UPV, Spain and Faculty of Computer Science, BUAP, Mexico;Department of Information Technologies, UAM, México;Faculty of Computer Science, BUAP, Mexico and Department of Information Systems and Computation, UPV, Spain
Venue:
CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval
Year:
2006

Citing 2
Cited 1

EuroGOV: engineering a multilingual web corpus

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
BUAP-UPV TPIRS: a system for document indexing reduction at WebCLEF

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories

Mixed monolingual homepage finding in 34 languages: the role of language script and search domain

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Nowadays, cross-lingual Information Retrieval (IR) is one of the greatest challenges to deal with. Besides, one of the most important issues in IR consists of the corpus vocabulary reduction. In real situations some methods of IR such as the well-known vector space model, it is necessary to reduce the term space. In this work, we have considered a vocabulary reduction process based on the selection of mid-frequency terms. Our approach enhances precision, but in order to obtain a better recall, we have conducted an enrichment process based on the addition of co-ocurrence terms. By using this approach, we have obtained an improvement of 40%, using the BiEnEs topics of the WebCLEF 2005 task. The obtained results in the current mixed monolingual task of the WebCLEF 2006 have shown that the text enrichment must be done before the vocabulary reduction process in order to get the best performance.