Discriminative training and maximum entropy models for statistical machine translation
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
BLEU: a method for automatic evaluation of machine translation
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Minimum error rate training in statistical machine translation
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Random restarts in minimum error rate training for statistical machine translation
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Decomposability of translation metrics for improved evaluation and efficient algorithms
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Lattice-based minimum error rate training for statistical machine translation
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Labelled dependencies in machine translation evaluation
StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Regularization and search for minimum error rate training
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Further meta-evaluation of machine translation
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Hi-index | 0.00 |
In Minimum Error Rate Training (MERT), Bleu is often used as the error function, despite the fact that it has been shown to have a lower correlation with human judgment than other metrics such as Meteor and Ter. In this paper, we present empirical results in which parameters tuned on Bleu may lead to sub-optimal Bleu scores under certain data conditions. Such scores can be improved significantly by tuning on an entirely different metric altogether, e.g. Meteor, by 0.0082 Bleu or 3.38% relative improvement on the WMT08 English---French data. We analyze the influence of the number of references and choice of metrics on the result of MERT and experiment on different data sets. We show the problems of tuning on a metric that is not designed for the single reference scenario and point out some possible solutions.