A grain of salt for the WMT manual evaluation

Authors:
Ondřej Bojar;Miloš Ercegovčević;Martin Popel;Omar F. Zaidan
Affiliations:
Charles University in Prague, Institute of Formal and Applied Linguistics;Charles University in Prague, Institute of Formal and Applied Linguistics;Charles University in Prague, Institute of Formal and Applied Linguistics;Johns Hopkins University
Venue:
WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Year:
2011

Citing 3
Cited 5

The kappa statistic: a second look

Computational Linguistics
Inter-coder agreement for computational linguistics

Computational Linguistics
Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

HyTER: meaning-equivalent semantics for translation evaluation

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Towards a predicate-argument evaluation for MT

SSST-6 '12 Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation
Putting human assessments of machine translation systems in order

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Findings of the 2012 workshop on statistical machine translation

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Statistical machine translation enhancements through linguistic levels: A survey

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Workshop on Statistical Machine Translation (WMT) has become one of ACL's flagship workshops, held annually since 2006. In addition to soliciting papers from the research community, WMT also features a shared translation task for evaluating MT systems. This shared task is notable for having manual evaluation as its cornerstone. The Workshop's overview paper, playing a descriptive and administrative role, reports the main results of the evaluation without delving deep into analyzing those results. The aim of this paper is to investigate and explain some interesting idiosyncrasies in the reported results, which only become apparent when performing a more thorough analysis of the collected annotations. Our analysis sheds some light on how the reported results should (and should not) be interpreted, and also gives rise to some helpful recommendation for the organizers of WMT.