Customizing parallel corpora at the document level

  • Authors:
  • Monica Rogati;Yiming Yang

  • Affiliations:
  • Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA

  • Venue:
  • ACLdemo '04 Proceedings of the ACL 2004 on Interactive poster and demonstration sessions
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent research in cross-lingual information retrieval (CLIR) established the need for properly matching the parallel corpus used for query translation to the target corpus. We propose a document-level approach to solving this problem: building a custom-made parallel corpus by automatically assembling it from documents taken from other parallel corpora. Although the general idea can be applied to any application that uses parallel corpora, we present results for CLIR in the medical domain. In order to extract the best-matched documents from several parallel corpora, we propose ranking individual documents by using a length-normalized Okapi-based similarity score between them and the target corpus. This ranking allows us to discard 50-90% of the training data, while avoiding the performance drop caused by a good but mismatched resource, and even improving CLIR effectiveness by 4-7% when compared to using all available training data.