How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation

Authors:
Stefanie Nowak;Stefan Rüger
Affiliations:
Fraunhofer IDMT, Ilmenau, Germany;Open University, Milton Keynes, England UK
Venue:
Proceedings of the international conference on Multimedia information retrieval
Year:
2010

Citing 10
Cited 28

Variations in relevance judgments and the measurement of retrieval effectiveness

Information Processing and Management: an International Journal
Get another label? improving data quality and data mining using multiple, noisy labelers

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
The MIR flickr retrieval evaluation

MIR '08 Proceedings of the 1st ACM international conference on Multimedia information retrieval
Efficiently learning the accuracy of labeling sources for selective sampling

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Data quality from crowdsourcing: a study of annotation selection criteria

HLT '09 Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A crowdsourceable QoE evaluation framework for multimedia content

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Acquiring high quality non-expert knowledge from on-demand workforce

People's Web '09 Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources
Overview of the CLEF 2009 large-scale visual concept detection and annotation task

CLEF'09 Proceedings of the 10th international conference on Cross-language evaluation forum: multimedia experiments
Modeling Human Judgment of Digital Imagery for Multimedia Retrieval

IEEE Transactions on Multimedia

New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative

Proceedings of the international conference on Multimedia information retrieval
Automatic image semantic interpretation using social action and tagging data

Multimedia Tools and Applications
Quantifying QoS requirements of network services: a cheat-proof framework

MMSys '11 Proceedings of the second annual ACM conference on Multimedia systems
Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Repeatable and reliable search system evaluation using crowdsourcing

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Guess what? a game for affective annotation of video using crowd sourcing

ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part I
Random partial paired comparison for subjective video quality assessment via hodgerank

MM '11 Proceedings of the 19th ACM international conference on Multimedia
Multi-modal region selection approach for training object detectors

Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
CDAS: a crowdsourcing data analytics system

Proceedings of the VLDB Endowment
Active learning for hierarchical text classification

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Using crowdsourcing for TREC relevance assessment

Information Processing and Management: an International Journal
Ground truth generation in medical imaging: a crowdsourcing-based iterative approach

Proceedings of the ACM multimedia 2012 workshop on Crowdsourcing for multimedia
Crowdsourcing micro-level multimedia annotations: the challenges of evaluation and interface

Proceedings of the ACM multimedia 2012 workshop on Crowdsourcing for multimedia
A prototype tool set to support machine-assisted annotation

BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
Quality control mechanisms for crowdsourcing: peer review, arbitration, & expertise at familysearch indexing

Proceedings of the 2013 conference on Computer supported cooperative work
Supervised collaboration for syntactic annotation of Quranic Arabic

Language Resources and Evaluation
An analysis of human factors and label accuracy in crowdsourcing relevance judgments

Information Retrieval
Tagging human activities in video by crowdsourcing

Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
Fashion-focused creative commons social dataset

Proceedings of the 4th ACM Multimedia Systems Conference
Assessing internet video quality using crowdsourcing

Proceedings of the 2nd ACM international workshop on Crowdsourcing for multimedia
Crowdsourcing for affective-interaction in computer games

Proceedings of the 2nd ACM international workshop on Crowdsourcing for multimedia
RDF data and image annotations in ResearchSpace

Proceedings of the 1st International Workshop on Collaborative Annotations in Shared Environment: metadata, vocabularies and techniques in the Digital Humanities
Who is persuasive?: the role of perceived personality and communication modality in social multimedia

Proceedings of the 15th ACM on International conference on multimodal interaction
Repeatable and reliable semantic search evaluation

Web Semantics: Science, Services and Agents on the World Wide Web
Facing reality: an industrial view on large scale use of facial expression analysis

Proceedings of the 2013 on Emotion recognition in the wild challenge and workshop
Toward crowdsourcing micro-level behavior annotations: the challenges of interface, training, and generalization

Proceedings of the 19th international conference on Intelligent User Interfaces
Fashion 10000: an enriched social image dataset for fashion and clothing

Proceedings of the 5th ACM Multimedia Systems Conference
STFU NOOB!: predicting crowdsourced decisions on toxic behavior in online games

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

The creation of golden standard datasets is a costly business. Optimally more than one judgment per document is obtained to ensure a high quality on annotations. In this context, we explore how much annotations from experts differ from each other, how different sets of annotations influence the ranking of systems and if these annotations can be obtained with a crowdsourcing approach. This study is applied to annotations of images with multiple concepts. A subset of the images employed in the latest ImageCLEF Photo Annotation competition was manually annotated by expert annotators and non-experts with Mechanical Turk. The inter-annotator agreement is computed at an image-based and concept-based level using majority vote, accuracy and kappa statistics. Further, the Kendall τ and Kolmogorov-Smirnov correlation test is used to compare the ranking of systems regarding different ground-truths and different evaluation measures in a benchmark scenario. Results show that while the agreement between experts and non-experts varies depending on the measure used, its influence on the ranked lists of the systems is rather small. To sum up, the majority vote applied to generate one annotation set out of several opinions, is able to filter noisy judgments of non-experts to some extent. The resulting annotation set is of comparable quality to the annotations of experts.