Using query-relevant documents pairs for cross-lingual information retrieval

Authors:
David Pinto;Alfons Juan;Paolo Rosso
Affiliations:
Department of Information Systems and Computation, Polytechnic University of Valencia, Spain and Faculty of Computer Science, B. Autonomous University of Puebla, Mexico;Department of Information Systems and Computation, Polytechnic University of Valencia, Spain;Department of Information Systems and Computation, Polytechnic University of Valencia, Spain
Venue:
TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Year:
2007

Citing 11
Cited 2

A statistical approach to machine translation

Computational Linguistics
Probabilistic models in information retrieval

The Computer Journal - Special issue on information retrieval
The nature of statistical learning theory

The nature of statistical learning theory
Modern Information Retrieval

Modern Information Retrieval
Embedding web-based statistical translation models in cross-language information retrieval

Computational Linguistics - Special issue on web as corpus
Overview of WebCLEF 2005

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
EuroGOV: engineering a multilingual web corpus

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
University of alicante at the CLEF 2005 WebCLEF track

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
BUAP-UPV TPIRS: a system for document indexing reduction at WebCLEF

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
UNED at WebCLEF 2005

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
Clustering abstracts of scientific texts using the transition point technique

CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing

A statistical approach to crosslingual natural language tasks

Journal of Algorithms
Cross-language plagiarism detection

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

The world wide web is a natural setting for cross-lingual information retrieval. The European Union is a typical example of a multilingual scenario, where multiple users have to deal with information published in at least 20 languages. Given queries in some source language and a target corpus in another language, the typical approximation consists in translating either the query or the target dataset to the other language. Other approaches use parallel corpora to obtain a statistical dictionary of words among the different languages. In this work, we propose to use a training corpus made up by a set of Query-Relevant Document Pairs (QRDP) in a probabilistic cross-lingual information retrieval approach which is based on the IBM alignment model 1 for statistical machine translation. Our approach has two main advantages over those that use direct translation and parallel corpora: we will not obtain a translation of the query, but a set of associated words which share their meaning in some way and, therefore, the obtained dictionary is, in a broad sense, more semantic than a translation one. Besides, since the queries are supervised, we are working in a more restricted domain than that when using a general parallel corpus (it is well known that in this context results are better than those which are performed in a general context). In order to determine the quality of our experiments, we compared the results with those obtained by a direct translation of the queries with a query translation system, observing promising results.