Intersecting multilingual data for faster and better statistical translations

  • Authors:
  • Yu Chen;Martin Kay;Andreas Eisele

  • Affiliations:
  • Universität des Saarlandes, Saarbrücken, Germany and Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Saarbrücken, Germany;Universität des Saarlandes, Saarbrücken, Germany and Stanford University, CA;Universität des Saarlandes, Saarbrücken, Germany and Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Saarbrücken, Germany

  • Venue:
  • NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In current phrase-based SMT systems, more training data is generally better than less. However, a larger data set eventually introduces a larger model that enlarges the search space for the translation problem, and consequently requires more time and more resources to translate. We argue redundant information in a SMT system may not only delay the computations but also affect the quality of the outputs. This paper proposes an approach to reduce the model size by filtering out the less probable entries based on compatible data in an intermediate language, a novel use of triangulation, without sacrificing the translation quality. Comprehensive experiments were conducted on standard data sets. We achieved significant quality improvements (up to 2.3 Bleu points) while translating with reduced models. In addition, we demonstrate a straightforward combination method for more progressive filtering. The reduction of the model size can be up to 94% with the translation quality being preserved.