Exploiting variant corpora for machine translation

  • Authors:
  • Michael Paul;Eiichiro Sumita

  • Affiliations:
  • National Institute of Information and Communications Technology and ATR Spoken Language Communication Research Labs, Keihanna Science City, Kyoto;National Institute of Information and Communications Technology and ATR Spoken Language Communication Research Labs, Keihanna Science City, Kyoto

  • Venue:
  • NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes the usage of variant corpora, i.e., parallel text corpora that are equal in meaning but use different ways to express content, in order to improve corpus-based machine translation. The usage of multiple training corpora of the same content with different sources results in variant models that focus on specific linguistic phenomena covered by the respective corpus. The proposed method applies each variant model separately resulting in multiple translation hypotheses which are selectively combined according to statistical models. The proposed method outperforms the conventional approach of merging all variants by reducing translation ambiguities and exploiting the strengths of each variant model.