Ranking vs. regression in machine translation evaluation

Authors:
Kevin Duh
Affiliations:
University of Washington, Seattle, WA
Venue:
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Year:
2008

Citing 9
Cited 10

Optimizing search engines using clickthrough data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
An efficient boosting algorithm for combining preferences

The Journal of Machine Learning Research
A machine learning approach to the automatic evaluation of machine translation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Paraphrasing for automatic evaluation

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Human evaluation of machine translation through binary system comparisons

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
(Meta-) evaluation of machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Sentence level machine translation evaluation as a ranking problem: one step aside from BLEU

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation

Regression for machine translation evaluation at the sentence level

Machine Translation
Further meta-evaluation of machine translation

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Revisiting pivot language approach for machine translation

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
TrustRank: inducing trust in automatic translations via ranking

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Improvement of machine translation evaluation by simple linguistically motivated features

Journal of Computer Science and Technology - Special issue on natural language processing
e-rating machine translation

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Regression and ranking based optimisation for sentence level machine translation evaluation

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Combining quality prediction and system selection for improved automatic translation output

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Fusion of word and letter based metrics for automatic MT evaluation

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Sentence-level ranking with quality estimation

Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatic evaluation of machine translation (MT) systems is an important research topic for the advancement of MT technology. Most automatic evaluation methods proposed to date are score-based: they compute scores that represent translation quality, and MT systems are compared on the basis of these scores. We advocate an alternative perspective of automatic MT evaluation based on ranking. Instead of producing scores, we directly produce a ranking over the set of MT systems to be compared. This perspective is often simpler when the evaluation goal is system comparison. We argue that it is easier to elicit human judgments of ranking and develop a machine learning approach to train on rank data. We compare this ranking method to a score-based regression method on WMT07 data. Results indicate that ranking achieves higher correlation to human judgments, especially in cases where ranking-specific features are used.