Efficient construction of large test collections
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness
Information Processing and Management: an International Journal
Labeling images with a computer game
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
Crowdsourcing user studies with Mechanical Turk
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Relevance assessment: are judges exchangeable and does it matter
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
How does clickthrough data reflect retrieval quality?
Proceedings of the 17th ACM conference on Information and knowledge management
Crowdsourcing for relevance evaluation
ACM SIGIR Forum
Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business
Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business
Comparative analysis of clicks and judgments for IR evaluation
Proceedings of the 2009 workshop on Web Search Click Data
Towards methods for the collective gathering and quality control of relevance assessments
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
The role of game theory in human computation systems
Proceedings of the ACM SIGKDD Workshop on Human Computation
Financial incentives and the "performance of crowds"
Proceedings of the ACM SIGKDD Workshop on Human Computation
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Power-Law Distributions in Empirical Data
SIAM Review
Proceedings of the international conference on Multimedia information retrieval
Are your participants gaming the system?: screening mechanical turk workers
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Who are the crowdworkers?: shifting demographics in mechanical turk
CHI '10 Extended Abstracts on Human Factors in Computing Systems
The effect of assessor error on IR system evaluation
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Sellers' problems in human computation markets
Proceedings of the ACM SIGKDD Workshop on Human Computation
Quality management on Amazon Mechanical Turk
Proceedings of the ACM SIGKDD Workshop on Human Computation
Crowdsourcing document relevance assessment with Mechanical Turk
CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Analyzing the Amazon Mechanical Turk marketplace
XRDS: Crossroads, The ACM Magazine for Students - Comp-YOU-Ter
Bayesian knowledge corroboration with logical rules and user feedback
ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II
Crowdsourcing systems on the World-Wide Web
Communications of the ACM
Designing incentives for inexpert human raters
Proceedings of the ACM 2011 conference on Computer supported cooperative work
Human computation: a survey and taxonomy of a growing field
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Design and implementation of relevance assessments using crowdsourcing
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
In search of quality in crowdsourcing for search engine evaluation
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Instrumenting the crowd: using implicit behavioral measures to predict task performance
Proceedings of the 24th annual ACM symposium on User interface software and technology
Anatomy of a Crowdsourcing Platform - Using the Example of Microworkers.com
IMIS '11 Proceedings of the 2011 Fifth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing
Worker types and personality traits in crowdsourcing relevance labels
Proceedings of the 20th ACM international conference on Information and knowledge management
Crowdsourcing assessments for XML ranked retrieval
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Crowdsourcing for information retrieval: introduction to the special issue
Information Retrieval
The wisdom of minority: discovering and targeting the right group of workers for crowdsourcing
Proceedings of the 23rd international conference on World wide web
Hi-index | 0.00 |
Crowdsourcing relevance judgments for the evaluation of search engines is used increasingly to overcome the issue of scalability that hinders traditional approaches relying on a fixed group of trusted expert judges. However, the benefits of crowdsourcing come with risks due to the engagement of a self-forming group of individuals--the crowd, motivated by different incentives, who complete the tasks with varying levels of attention and success. This increases the need for a careful design of crowdsourcing tasks that attracts the right crowd for the given task and promotes quality work. In this paper, we describe a series of experiments using Amazon's Mechanical Turk, conducted to explore the `human' characteristics of the crowds involved in a relevance assessment task. In the experiments, we vary the level of pay offered, the effort required to complete a task and the qualifications required of the workers. We observe the effects of these variables on the quality of the resulting relevance labels, measured based on agreement with a gold set, and correlate them with self-reported measures of various human factors. We elicit information from the workers about their motivations, interest and familiarity with the topic, perceived task difficulty, and satisfaction with the offered pay. We investigate how these factors combine with aspects of the task design and how they affect the accuracy of the resulting relevance labels. Based on the analysis of 960 HITs and 2,880 HIT assignments resulting in 19,200 relevance labels, we arrive at insights into the complex interaction of the observed factors and provide practical guidelines to crowdsourcing practitioners. In addition, we highlight challenges in the data analysis that stem from the peculiarity of the crowdsourcing environment where the sample of individuals engaged in specific work conditions are inherently influenced by the conditions themselves.