Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking

Authors:
Gabriella Kazai;Jaap Kamps;Marijn Koolen;Natasa Milic-Frayling
Affiliations:
Microsoft Research, Cambridge, United Kingdom;University of Amsterdam, Amsterdam, Netherlands;University of Amsterdam, Amsterdam, Netherlands;Microsoft Research, Cambridge, United Kingdom
Venue:
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Year:
2011

Citing 21
Cited 23

Efficient construction of large test collections

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness

Information Processing and Management: an International Journal
Ranking retrieval systems without relevance judgments

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Labeling images with a computer game

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Providing consistent and exhaustive relevance assessments for XML retrieval evaluation

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Crowdsourcing user studies with Mechanical Turk

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Get another label? improving data quality and data mining using multiple, noisy labelers

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
How does clickthrough data reflect retrieval quality?

Proceedings of the 17th ACM conference on Information and knowledge management
Crowdsourcing for relevance evaluation

ACM SIGIR Forum
Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business

Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business
Towards methods for the collective gathering and quality control of relevance assessments

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Financial incentives and the "performance of crowds"

Proceedings of the ACM SIGKDD Workshop on Human Computation
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Report on the SIGIR 2009 workshop on the future of IR evaluation

ACM SIGIR Forum
How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation

Proceedings of the international conference on Multimedia information retrieval
Are your participants gaming the system?: screening mechanical turk workers

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Crowdsourcing document relevance assessment with Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Human computation: a survey and taxonomy of a growing field

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Design and implementation of relevance assessments using crowdsourcing

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
In search of quality in crowdsourcing for search engine evaluation

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Overview of the INEX 2010 book track: scaling up the evaluation using crowdsourcing

INEX'10 Proceedings of the 9th international conference on Initiative for the evaluation of XML retrieval: comparative evaluation of focused retrieval

Overview of the INEX 2010 book track: scaling up the evaluation using crowdsourcing

INEX'10 Proceedings of the 9th international conference on Initiative for the evaluation of XML retrieval: comparative evaluation of focused retrieval
Worker types and personality traits in crowdsourcing relevance labels

Proceedings of the 20th ACM international conference on Information and knowledge management
Quality assurance in document conversion: a hit?

Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing
Topical clustering of search results

Proceedings of the fifth ACM international conference on Web search and data mining
ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking

Proceedings of the 21st international conference on World Wide Web
On aggregating labels from multiple crowd workers to infer relevance of documents

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
CDAS: a crowdsourcing data analytics system

Proceedings of the VLDB Endowment
On judgments obtained from a commercial search engine

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Retrieval evaluation on focused tasks

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Using crowdsourcing for TREC relevance assessment

Information Processing and Management: an International Journal
An analysis of systematic judging errors in information retrieval

Proceedings of the 21st ACM international conference on Information and knowledge management
Social book search: comparing topical relevance judgements and book suggestions for evaluation

Proceedings of the 21st ACM international conference on Information and knowledge management
An examination of content farms in web search using crowdsourcing

Proceedings of the 21st ACM international conference on Information and knowledge management
The face of quality in crowdsourcing relevance labels: demographics, personality and labeling accuracy

Proceedings of the 21st ACM international conference on Information and knowledge management
Crowdsourcing for information retrieval: introduction to the special issue

Information Retrieval
An analysis of human factors and label accuracy in crowdsourcing relevance judgments

Information Retrieval
An online cost sensitive decision-making method in crowdsourcing systems

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
The effect of threshold priming and need for cognition on relevance calibration and assessment

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Statistical quality estimation for general crowdsourcing tasks

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Pick-a-crowd: tell me what you like, and i'll tell you what to do

Proceedings of the 22nd international conference on World Wide Web
Searching online book documents and analyzing book citations

Proceedings of the 2013 ACM symposium on Document engineering
User intent and assessor disagreement in web search evaluation

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Large-scale linked data integration using probabilistic reasoning and crowdsourcing

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

The evaluation of information retrieval (IR) systems over special collections, such as large book repositories, is out of reach of traditional methods that rely upon editorial relevance judgments. Increasingly, the use of crowdsourcing to collect relevance labels has been regarded as a viable alternative that scales with modest costs. However, crowdsourcing suffers from undesirable worker practices and low quality contributions. In this paper we investigate the design and implementation of effective crowdsourcing tasks in the context of book search evaluation. We observe the impact of aspects of the Human Intelligence Task (HIT) design on the quality of relevance labels provided by the crowd. We assess the output in terms of label agreement with a gold standard data set and observe the effect of the crowdsourced relevance judgments on the resulting system rankings. This enables us to observe the effect of crowdsourcing on the entire IR evaluation process. Using the test set and experimental runs from the INEX 2010 Book Track, we find that varying the HIT design, and the pooling and document ordering strategies leads to considerable differences in agreement with the gold set labels. We then observe the impact of the crowdsourced relevance label sets on the relative system rankings using four IR performance metrics. System rankings based on MAP and Bpref remain less affected by different label sets while the Precision@10 and nDCG@10 lead to dramatically different system rankings, especially for labels acquired from HITs with weaker quality controls. Overall, we find that crowdsourcing can be an effective tool for the evaluation of IR systems, provided that care is taken when designing the HITs.