Predicting human-targeted translation edit rate via untrained human annotators

Authors:
Omar F. Zaidan;Chris Callison-Burch
Affiliations:
Johns Hopkins University, Baltimore, MD;Johns Hopkins University, Baltimore, MD
Venue:
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Year:
2010

Citing 2
Cited 2

Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Fast, cheap, and creative: evaluating translation quality using Amazon's Mechanical Turk

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1

They can help: using crowdsourcing to improve the evaluation of grammatical error detection systems

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Bucking the trend: improved evaluation and annotation practices for ESL error detection systems

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the field of machine translation, automatic metrics have proven quite valuable in system development for tracking progress and measuring the impact of incremental changes. However, human judgment still plays a large role in the context of evaluating MT systems. For example, the GALE project uses human-targeted translation edit rate (HTER), wherein the MT output is scored against a post-edited version of itself (as opposed to being scored against an existing human reference). This poses a problem for MT researchers, since HTER is not an easy metric to calculate, and would require hiring and training human an-notators to perform the editing task. In this work, we explore soliciting those edits from untrained human annotators, via the online service Amazon Mechanical Turk. We show that the collected data allows us to predict HTER-ranking of documents at a significantly higher level than the ranking obtained using automatic metrics.