The role of pseudo references in MT evaluation

Authors:
Joshua S. Albrecht;Rebecca Hwa
Affiliations:
University of Pittsburgh;University of Pittsburgh
Venue:
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Year:
2008

Citing 5
Cited 5

Making large-scale support vector machine learning practical

Advances in kernel methods
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
ORANGE: a method for evaluating automatic evaluation metrics for machine translation

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
(Meta-) evaluation of machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation

Further meta-evaluation of machine translation

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
TrustRank: inducing trust in automatic translations via ranking

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Linguistic measures for automatic machine translation evaluation

Machine Translation
Combining quality prediction and system selection for improved automatic translation output

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Automatically assessing machine summary content without a gold standard

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Previous studies have shown automatic evaluation metrics to be more reliable when compared against many human translations. However, multiple human references may not always be available. It is more common to have only a single human reference (extracted from parallel texts) or no reference at all. Our earlier work suggested that one way to address this problem is to train a metric to evaluate a sentence by comparing it against pseudo references, or imperfect "references" produced by off-the-shelf MT systems. In this paper, we further examine the approach both in terms of the training methodology and in terms of the role of the human and pseudo references. Our expanded experiments show that the approach generalizes well across multiple years and different source languages.