Customizing parallel corpora at the document level

Authors:
Monica Rogati;Yiming Yang
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA
Venue:
ACLdemo '04 Proceedings of the ACL 2004 on Interactive poster and demonstration sessions
Year:
2004

Citing 2
Cited 0

The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Resource selection for domain-specific cross-lingual IR

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent research in cross-lingual information retrieval (CLIR) established the need for properly matching the parallel corpus used for query translation to the target corpus. We propose a document-level approach to solving this problem: building a custom-made parallel corpus by automatically assembling it from documents taken from other parallel corpora. Although the general idea can be applied to any application that uses parallel corpora, we present results for CLIR in the medical domain. In order to extract the best-matched documents from several parallel corpora, we propose ranking individual documents by using a length-normalized Okapi-based similarity score between them and the target corpus. This ranking allows us to discard 50-90% of the training data, while avoiding the performance drop caused by a good but mismatched resource, and even improving CLIR effectiveness by 4-7% when compared to using all available training data.