Fast, cheap, and creative: evaluating translation quality using Amazon's Mechanical Turk

Authors:
Chris Callison-Burch
Affiliations:
Johns Hopkins University, Baltimore, Maryland
Venue:
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Year:
2009

Citing 7
Cited 90

BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Extending the BLEU MT evaluation method with frequency weightings

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Decomposability of translation metrics for improved evaluation and efficient algorithms

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Further meta-evaluation of machine translation

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Findings of the 2009 workshop on statistical machine translation

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Feasibility of human-in-the-loop minimum error rate training

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1

Are your participants gaming the system?: screening mechanical turk workers

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Putting the crowd to work in a knowledge-based factory

Advanced Engineering Informatics
Learning more powerful test statistics for click-based retrieval evaluation

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Cheap, fast and good enough: automatic speech recognition with non-expert transcription

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Crowdsourcing the evaluation of a domain-adapted named entity recognition system

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Predicting human-targeted translation edit rate via untrained human annotators

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Some empirical evidence for annotation noise in a benchmarked dataset

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
The best lexical metric for phrase-based statistical MT system optimization

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
BabelNet: building a very large multilingual semantic network

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Bridging SMT and TM with translation recommendation

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Emotions evoked by common words and phrases: using mechanical turk to create an emotion lexicon

CAAGET '10 Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text
Creating speech and language data with Amazon's Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Clustering dictionary definitions using Amazon Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Consensus versus expertise: a case study of word alignment with Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Rating computer-generated questions with Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Shared task: crowdsourced accessibility elicitation of Wikipedia articles

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Document image collection using Amazon's Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Using Amazon Mechanical Turk for transcription of non-native speech

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Exploring normalization techniques for human judgments of machine translation adequacy collected using Amazon Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Can crowds build parallel corpora for machine translation systems?

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Annotating large email datasets for named entity recognition with Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Annotating named entities in Twitter data with crowdsourcing

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
MTurk crowdsourcing: a viable method for rapid discovery of Arabic nicknames?

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Using Mechanical Turk to annotate lexicons for less commonly used languages

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Crowdsourcing and language studies: the new generation of linguistic data

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Collecting image annotations using Amazon's Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Non-expert evaluation of summarization systems is risky

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Preliminary experience with Amazon's Mechanical Turk for annotating medical named entities

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Amazon Mechanical Turk for subjectivity word sense disambiguation

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Using Mechanical Turk to build machine translation evaluation sets

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Creating a bi-lingual entailment corpus through translations with Mechanical Turk: $100 for a 10-day rush

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Rethinking grammatical error annotation and evaluation with the Amazon Mechanical Turk

IUNLPBEA '10 Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications
A semi-supervised word alignment algorithm with partial manual alignments

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Improving translation via targeted paraphrasing

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Social media for software engineering

Proceedings of the FSE/SDP workshop on Future of software engineering research
A crowdsourcing based mobile image translation and knowledge sharing service

Proceedings of the 9th International Conference on Mobile and Ubiquitous Multimedia
A data-driven case-based reasoning approach to interactive storytelling

ICIDS'10 Proceedings of the Third joint conference on Interactive digital storytelling
MT error detection for cross-lingual question answering

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Design and implementation of relevance assessments using crowdsourcing

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Collecting highly parallel data for paraphrase evaluation

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Reordering metrics for MT

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Crowdsourcing translation: professional quality from non-professionals

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
They can help: using crowdsourcing to improve the evaluation of grammatical error detection systems

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Repeatable and reliable search system evaluation using crowdsourcing

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Crowdsourcing word sense definition

LAW V '11 Proceedings of the 5th Linguistic Annotation Workshop
Colourful language: measuring word-colour associations

CMCL '11 Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics
Moving towards adaptive search in digital libraries

NLP4DL'09/AT4DL'09 Proceedings of the 2009 international conference on Advanced language technologies for digital libraries
Instrumenting the crowd: using implicit behavioral measures to predict task performance

Proceedings of the 24th annual ACM symposium on User interface software and technology
CrowdForge: crowdsourcing complex work

Proceedings of the 24th annual ACM symposium on User interface software and technology
Collaborative workflow for crowdsourcing translation

Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work
Active learning with Amazon Mechanical Turk

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Is Someone in this Office Available to Help Me?

Journal of Intelligent and Robotic Systems
Sentence-Level attachment prediction

IRFC'10 Proceedings of the First international Information Retrieval Facility conference on Adbances in Multidisciplinary Retrieval
Profanity use in online communities

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Impacts of machine translation and speech synthesis on speech-to-speech translation

Speech Communication
Building subjectivity lexicon(s) from scratch for essay data

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
Automatic identification of personal insults on social news sites

Journal of the American Society for Information Science and Technology
Effective temporal graph layout: a comparative study of animation versus static display methods

Information Visualization
Say Anything: Using Textual Case-Based Reasoning to Enable Open-Domain Interactive Storytelling

ACM Transactions on Interactive Intelligent Systems (TiiS) - Special Issue on Common Sense for Interactive Systems
Crowdsourcing research opportunities: lessons from natural language processing

Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies
Using crowdsourcing for TREC relevance assessment

Information Processing and Management: an International Journal
CrowdScape: interactively visualizing user behavior and output

Proceedings of the 25th annual ACM symposium on User interface software and technology
CLex: a lexicon for exploring color, concept and emotion associations in language

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Multiplicity and word sense: evaluating and learning from multiply labeled word sense annotations

Language Resources and Evaluation
Detecting subgroups in online discussions by modeling positive and negative relations among participants

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Extracting signed social networks from text

TextGraphs-7 '12 Workshop Proceedings of TextGraphs-7 on Graph-based Methods for Natural Language Processing
Findings of the 2012 workshop on statistical machine translation

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Twitter translation using translation-based cross-lingual retrieval

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Concept-based indexing of annotated images using semantic DNA

Engineering Applications of Artificial Intelligence
The future of crowd work

Proceedings of the 2013 conference on Computer supported cooperative work
Patterns for visualization evaluation

Proceedings of the 2012 BELIV Workshop: Beyond Time and Errors - Novel Evaluation Methods for Visualization
How to filter out random clickers in a crowdsourcing-based study?

Proceedings of the 2012 BELIV Workshop: Beyond Time and Errors - Novel Evaluation Methods for Visualization
Perspectives on crowdsourcing annotations for natural language processing

Language Resources and Evaluation
Phrase detectives: Utilizing collective intelligence for internet-scale language resource creation

ACM Transactions on Interactive Intelligent Systems (TiiS) - Special section on internet-scale human problem solving and regular papers
An introduction to crowdsourcing for language and multimedia technology research

PROMISE'12 Proceedings of the 2012 international conference on Information Retrieval Meets Information Visualization
Identifying top news using crowdsourcing

Information Retrieval
Implementing crowdsourcing-based relevance experimentation: an industrial perspective

Information Retrieval
Crowdsourcing performance evaluations of user interfaces

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
The efficacy of human post-editing for language translation

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Using targeted paraphrasing and monolingual crowdsourcing to improve translation

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Sections on Paraphrasing; Intelligent Systems for Socially Aware Computing; Social Computing, Behavioral-Cultural Modeling, and Prediction
Paraphrase acquisition via crowdsourcing and machine learning

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Sections on Paraphrasing; Intelligent Systems for Socially Aware Computing; Social Computing, Behavioral-Cultural Modeling, and Prediction
News vertical search: when and what to display to users

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Managing distractions in complex settings

Proceedings of the 15th international conference on Human-computer interaction with mobile devices and services
An analysis of question quality and user performance in crowdsourced exams

Proceedings of the 2013 workshop on Data-driven user behavioral modelling and mining from social media
Age-Based task specialization for crowdsourced proofreading

UAHCI'13 Proceedings of the 7th international conference on Universal Access in Human-Computer Interaction: user and context diversity - Volume 2
Repeatable and reliable semantic search evaluation

Web Semantics: Science, Services and Agents on the World Wide Web
How beliefs about the presence of machine translation impact multilingual collaborations

Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing
Crowdsourced Knowledge Acquisition: Towards Hybrid-Genre Workflows

International Journal on Semantic Web & Information Systems
Bucking the trend: improved evaluation and annotation practices for ESL error detection systems

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Manual evaluation of translation quality is generally thought to be excessively time consuming and expensive. We explore a fast and inexpensive way of doing it using Amazon's Mechanical Turk to pay small sums to a large number of non-expert annotators. For $10 we redundantly recreate judgments from a WMT08 translation task. We find that when combined non-expert judgments have a high-level of agreement with the existing gold-standard judgments of machine translation quality, and correlate more strongly with expert judgments than Bleu does. We go on to show that Mechanical Turk can be used to calculate human-mediated translation edit rate (HTER), to conduct reading comprehension experiments with machine translation, and to create high quality reference translations.