Efficient construction of large test collections
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness
Information Processing and Management: an International Journal
Ranking retrieval systems without relevance judgments
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Labeling images with a computer game
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Providing consistent and exhaustive relevance assessments for XML retrieval evaluation
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Crowdsourcing user studies with Mechanical Turk
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Get another label? improving data quality and data mining using multiple, noisy labelers
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
How does clickthrough data reflect retrieval quality?
Proceedings of the 17th ACM conference on Information and knowledge management
Crowdsourcing for relevance evaluation
ACM SIGIR Forum
Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business
Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business
Towards methods for the collective gathering and quality control of relevance assessments
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Financial incentives and the "performance of crowds"
Proceedings of the ACM SIGKDD Workshop on Human Computation
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Report on the SIGIR 2009 workshop on the future of IR evaluation
ACM SIGIR Forum
Proceedings of the international conference on Multimedia information retrieval
Are your participants gaming the system?: screening mechanical turk workers
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Crowdsourcing document relevance assessment with Mechanical Turk
CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Human computation: a survey and taxonomy of a growing field
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Design and implementation of relevance assessments using crowdsourcing
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
In search of quality in crowdsourcing for search engine evaluation
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Overview of the INEX 2010 book track: scaling up the evaluation using crowdsourcing
INEX'10 Proceedings of the 9th international conference on Initiative for the evaluation of XML retrieval: comparative evaluation of focused retrieval
Overview of the INEX 2010 book track: scaling up the evaluation using crowdsourcing
INEX'10 Proceedings of the 9th international conference on Initiative for the evaluation of XML retrieval: comparative evaluation of focused retrieval
Worker types and personality traits in crowdsourcing relevance labels
Proceedings of the 20th ACM international conference on Information and knowledge management
Quality assurance in document conversion: a hit?
Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing
Topical clustering of search results
Proceedings of the fifth ACM international conference on Web search and data mining
Proceedings of the 21st international conference on World Wide Web
On aggregating labels from multiple crowd workers to infer relevance of documents
ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
CDAS: a crowdsourcing data analytics system
Proceedings of the VLDB Endowment
On judgments obtained from a commercial search engine
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Retrieval evaluation on focused tasks
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Using crowdsourcing for TREC relevance assessment
Information Processing and Management: an International Journal
An analysis of systematic judging errors in information retrieval
Proceedings of the 21st ACM international conference on Information and knowledge management
Social book search: comparing topical relevance judgements and book suggestions for evaluation
Proceedings of the 21st ACM international conference on Information and knowledge management
An examination of content farms in web search using crowdsourcing
Proceedings of the 21st ACM international conference on Information and knowledge management
Proceedings of the 21st ACM international conference on Information and knowledge management
Crowdsourcing for information retrieval: introduction to the special issue
Information Retrieval
An analysis of human factors and label accuracy in crowdsourcing relevance judgments
Information Retrieval
An online cost sensitive decision-making method in crowdsourcing systems
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
The effect of threshold priming and need for cognition on relevance calibration and assessment
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Statistical quality estimation for general crowdsourcing tasks
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Pick-a-crowd: tell me what you like, and i'll tell you what to do
Proceedings of the 22nd international conference on World Wide Web
Searching online book documents and analyzing book citations
Proceedings of the 2013 ACM symposium on Document engineering
User intent and assessor disagreement in web search evaluation
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Large-scale linked data integration using probabilistic reasoning and crowdsourcing
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.00 |
The evaluation of information retrieval (IR) systems over special collections, such as large book repositories, is out of reach of traditional methods that rely upon editorial relevance judgments. Increasingly, the use of crowdsourcing to collect relevance labels has been regarded as a viable alternative that scales with modest costs. However, crowdsourcing suffers from undesirable worker practices and low quality contributions. In this paper we investigate the design and implementation of effective crowdsourcing tasks in the context of book search evaluation. We observe the impact of aspects of the Human Intelligence Task (HIT) design on the quality of relevance labels provided by the crowd. We assess the output in terms of label agreement with a gold standard data set and observe the effect of the crowdsourced relevance judgments on the resulting system rankings. This enables us to observe the effect of crowdsourcing on the entire IR evaluation process. Using the test set and experimental runs from the INEX 2010 Book Track, we find that varying the HIT design, and the pooling and document ordering strategies leads to considerable differences in agreement with the gold set labels. We then observe the impact of the crowdsourced relevance label sets on the relative system rankings using four IR performance metrics. System rankings based on MAP and Bpref remain less affected by different label sets while the Precision@10 and nDCG@10 lead to dramatically different system rankings, especially for labels acquired from HITs with weaker quality controls. Overall, we find that crowdsourcing can be an effective tool for the evaluation of IR systems, provided that care is taken when designing the HITs.