Putting human assessments of machine translation systems in order

Authors:
Adam Lopez
Affiliations:
Johns Hopkins University
Venue:
WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Year:
2012

Citing 5
Cited 3

Ranking Tournaments

SIAM Journal on Discrete Mathematics
The Minimum Feedback Arc Set Problem is NP-Hard for Tournaments

Combinatorics, Probability and Computing
Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
A grain of salt for the WMT manual evaluation

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Findings of the 2011 Workshop on Statistical Machine Translation

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation

Findings of the 2012 workshop on statistical machine translation

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
The efficacy of human post-editing for language translation

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Sentence-level ranking with quality estimation

Machine Translation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Human assessment is often considered the gold standard in evaluation of translation systems. But in order for the evaluation to be meaningful, the rankings obtained from human assessment must be consistent and repeatable. Recent analysis by Bojar et al. (2011) raised several concerns about the rankings derived from human assessments of English-Czech translation systems in the 2010 Workshop on Machine Translation. We extend their analysis to all of the ranking tasks from 2010 and 2011, and show through an extension of their reasoning that the ranking is naturally cast as an instance of finding the minimum feedback arc set in a tournament, a well-known NP-complete problem. All instances of this problem in the workshop data are efficiently solvable, but in some cases the rankings it produces are surprisingly different from the ones previously published. This leads to strong caveats and recommendations for both producers and consumers of these rankings.