Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Exploring content models for multi-document summarization
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Fast, cheap, and creative: evaluating translation quality using Amazon's Mechanical Turk
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Creating speech and language data with Amazon's Mechanical Turk
CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Jointly learning to extract and compress
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Extractive multi-document summaries should explicitly not contain document-specific content
WASDGML '11 Proceedings of the Workshop on Automatic Summarization for Different Genres, Media, and Languages
Crowdsourcing research opportunities: lessons from natural language processing
Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies
Automatically assessing machine summary content without a gold standard
Computational Linguistics
Bucking the trend: improved evaluation and annotation practices for ESL error detection systems
Language Resources and Evaluation
Hi-index | 0.00 |
We provide evidence that intrinsic evaluation of summaries using Amazon's Mechanical Turk is quite difficult. Experiments mirroring evaluation at the Text Analysis Conference's summarization track show that non-expert judges are not able to recover system rankings derived from experts.