Improving English-Spanish statistical machine translation: experiments in domain adaptation, sentence paraphrasing, tokenization, and recasing

Authors:
Preslav Nakov
Affiliations:
University of California at Berkeley, Berkeley, CA
Venue:
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Year:
2008

Citing 6
Cited 11

BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Minimum error rate training in statistical machine translation

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Using the web as an implicit training set: application to noun compound syntax and semantics

Using the web as an implicit training set: application to noun compound syntax and semantics
UCB system description for the WMT 2007 shared task

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Experiments in domain adaptation for statistical machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Manual and automatic evaluation of machine translation between European languages

StatMT '06 Proceedings of the Workshop on Statistical Machine Translation

Noun Compound Interpretation Using Paraphrasing Verbs: Feasibility Study

AIMSA '08 Proceedings of the 13th international conference on Artificial Intelligence: Methodology, Systems, and Applications
Improved Statistical Machine Translation Using Monolingual Paraphrases

Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence
Further meta-evaluation of machine translation

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
NUS at WMT09: domain adaptation experiments for English-Spanish machine translation of news commentary text

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
An empirical study on development set selection strategy for machine translation learning

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Facilitating translation using source language paraphrase lattices

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Incorporating source-language paraphrases into phrase-based SMT with confusion networks

SSST-5 Proceedings of the Fifth Workshop on Syntax, Semantics and Structure in Statistical Translation
Domain adaptation via pseudo in-domain data selection

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Cutting the long tail: hybrid language models for translation style adaptation

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
QCRI at WMT12: experiments in Spanish-English and German-English machine translation of news text

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Semantic interpretation of noun compounds using verbal and other paraphrases

ACM Transactions on Speech and Language Processing (TSLP) - Special issue on multiword expressions: From theory to practice and use, part 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe the experiments of the UC Berkeley team on improving English-Spanish machine translation of news text, as part of the WMT'08 Shared Translation Task. We experiment with domain adaptation, combining a small in-domain news bi-text and a large out-of-domain one from the Europarl corpus, building two separate phrase translation models and two separate language models. We further add a third phrase translation model trained on a version of the news bi-text augmented with monolingual sentence-level syntactic paraphrases on the source-language side, and we combine all models in a log-linear model using minimum error rate training. Finally, we experiment with different tokenization and recasing rules, achieving 35.09% Bleu score on the WMT'07 news test data when translating from English to Spanish, which is a sizable improvement over the highest Bleu score achieved on that dataset at WMT'07: 33.10% (in fact, by our system). On the WMT'08 English to Spanish news translation, we achieve 21.92%, which makes our team the second best on Bleu score.