Intersecting multilingual data for faster and better statistical translations

Authors:
Yu Chen;Martin Kay;Andreas Eisele
Affiliations:
Universität des Saarlandes, Saarbrücken, Germany and Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Saarbrücken, Germany;Universität des Saarlandes, Saarbrücken, Germany and Stanford University, CA;Universität des Saarlandes, Saarbrücken, Germany and Deutsches Forschungszentrum für Künstliche Intelligenz GmbH, Saarbrücken, Germany
Venue:
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Year:
2009

Citing 5
Cited 3

The Proper Place of Men and Machines inLanguage Translation

Machine Translation
Noun phrase translation

Noun phrase translation
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Minimum error rate training in statistical machine translation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions

Hitting the right paraphrases in good time

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Learning to simplify sentences using Wikipedia

MTTG '11 Proceedings of the Workshop on Monolingual Text-To-Text Generation
A systematic comparison of phrase table pruning techniques

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

In current phrase-based SMT systems, more training data is generally better than less. However, a larger data set eventually introduces a larger model that enlarges the search space for the translation problem, and consequently requires more time and more resources to translate. We argue redundant information in a SMT system may not only delay the computations but also affect the quality of the outputs. This paper proposes an approach to reduce the model size by filtering out the less probable entries based on compatible data in an intermediate language, a novel use of triangulation, without sacrificing the translation quality. Comprehensive experiments were conducted on standard data sets. We achieved significant quality improvements (up to 2.3 Bleu points) while translating with reduced models. In addition, we demonstrate a straightforward combination method for more progressive filtering. The reduction of the model size can be up to 94% with the translation quality being preserved.