Making large-scale support vector machine learning practical
Advances in kernel methods
BLEU: a method for automatic evaluation of machine translation
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
ORANGE: a method for evaluating automatic evaluation metrics for machine translation
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Moses: open source toolkit for statistical machine translation
ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
(Meta-) evaluation of machine translation
StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Further meta-evaluation of machine translation
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
TrustRank: inducing trust in automatic translations via ranking
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Linguistic measures for automatic machine translation evaluation
Machine Translation
Combining quality prediction and system selection for improved automatic translation output
WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Automatically assessing machine summary content without a gold standard
Computational Linguistics
Hi-index | 0.00 |
Previous studies have shown automatic evaluation metrics to be more reliable when compared against many human translations. However, multiple human references may not always be available. It is more common to have only a single human reference (extracted from parallel texts) or no reference at all. Our earlier work suggested that one way to address this problem is to train a metric to evaluate a sentence by comparing it against pseudo references, or imperfect "references" produced by off-the-shelf MT systems. In this paper, we further examine the approach both in terms of the training methodology and in terms of the role of the human and pseudo references. Our expanded experiments show that the approach generalizes well across multiple years and different source languages.