Variations in relevance judgments and the measurement of retrieval effectiveness
Information Processing and Management: an International Journal
Get another label? improving data quality and data mining using multiple, noisy labelers
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
The MIR flickr retrieval evaluation
MIR '08 Proceedings of the 1st ACM international conference on Multimedia information retrieval
Efficiently learning the accuracy of labeling sources for selective sampling
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Data quality from crowdsourcing: a study of annotation selection criteria
HLT '09 Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A crowdsourceable QoE evaluation framework for multimedia content
MM '09 Proceedings of the 17th ACM international conference on Multimedia
Acquiring high quality non-expert knowledge from on-demand workforce
People's Web '09 Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources
Overview of the CLEF 2009 large-scale visual concept detection and annotation task
CLEF'09 Proceedings of the 10th international conference on Cross-language evaluation forum: multimedia experiments
Modeling Human Judgment of Digital Imagery for Multimedia Retrieval
IEEE Transactions on Multimedia
New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative
Proceedings of the international conference on Multimedia information retrieval
Automatic image semantic interpretation using social action and tagging data
Multimedia Tools and Applications
Quantifying QoS requirements of network services: a cheat-proof framework
MMSys '11 Proceedings of the second annual ACM conference on Multimedia systems
Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Repeatable and reliable search system evaluation using crowdsourcing
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Guess what? a game for affective annotation of video using crowd sourcing
ACII'11 Proceedings of the 4th international conference on Affective computing and intelligent interaction - Volume Part I
Random partial paired comparison for subjective video quality assessment via hodgerank
MM '11 Proceedings of the 19th ACM international conference on Multimedia
Multi-modal region selection approach for training object detectors
Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
CDAS: a crowdsourcing data analytics system
Proceedings of the VLDB Endowment
Active learning for hierarchical text classification
PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Using crowdsourcing for TREC relevance assessment
Information Processing and Management: an International Journal
Ground truth generation in medical imaging: a crowdsourcing-based iterative approach
Proceedings of the ACM multimedia 2012 workshop on Crowdsourcing for multimedia
Crowdsourcing micro-level multimedia annotations: the challenges of evaluation and interface
Proceedings of the ACM multimedia 2012 workshop on Crowdsourcing for multimedia
A prototype tool set to support machine-assisted annotation
BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
Proceedings of the 2013 conference on Computer supported cooperative work
Supervised collaboration for syntactic annotation of Quranic Arabic
Language Resources and Evaluation
An analysis of human factors and label accuracy in crowdsourcing relevance judgments
Information Retrieval
Tagging human activities in video by crowdsourcing
Proceedings of the 3rd ACM conference on International conference on multimedia retrieval
Fashion-focused creative commons social dataset
Proceedings of the 4th ACM Multimedia Systems Conference
Assessing internet video quality using crowdsourcing
Proceedings of the 2nd ACM international workshop on Crowdsourcing for multimedia
Crowdsourcing for affective-interaction in computer games
Proceedings of the 2nd ACM international workshop on Crowdsourcing for multimedia
RDF data and image annotations in ResearchSpace
Proceedings of the 1st International Workshop on Collaborative Annotations in Shared Environment: metadata, vocabularies and techniques in the Digital Humanities
Proceedings of the 15th ACM on International conference on multimodal interaction
Repeatable and reliable semantic search evaluation
Web Semantics: Science, Services and Agents on the World Wide Web
Facing reality: an industrial view on large scale use of facial expression analysis
Proceedings of the 2013 on Emotion recognition in the wild challenge and workshop
Proceedings of the 19th international conference on Intelligent User Interfaces
Fashion 10000: an enriched social image dataset for fashion and clothing
Proceedings of the 5th ACM Multimedia Systems Conference
STFU NOOB!: predicting crowdsourced decisions on toxic behavior in online games
Proceedings of the 23rd international conference on World wide web
Hi-index | 0.00 |
The creation of golden standard datasets is a costly business. Optimally more than one judgment per document is obtained to ensure a high quality on annotations. In this context, we explore how much annotations from experts differ from each other, how different sets of annotations influence the ranking of systems and if these annotations can be obtained with a crowdsourcing approach. This study is applied to annotations of images with multiple concepts. A subset of the images employed in the latest ImageCLEF Photo Annotation competition was manually annotated by expert annotators and non-experts with Mechanical Turk. The inter-annotator agreement is computed at an image-based and concept-based level using majority vote, accuracy and kappa statistics. Further, the Kendall τ and Kolmogorov-Smirnov correlation test is used to compare the ranking of systems regarding different ground-truths and different evaluation measures in a benchmark scenario. Results show that while the agreement between experts and non-experts varies depending on the measure used, its influence on the ranked lists of the systems is rather small. To sum up, the majority vote applied to generate one annotation set out of several opinions, is able to filter noisy judgments of non-experts to some extent. The resulting annotation set is of comparable quality to the annotations of experts.