On aggregating labels from multiple crowd workers to infer relevance of documents

Authors:
Mehdi Hosseini;Ingemar J. Cox;Nataša Milić-Frayling;Gabriella Kazai;Vishwa Vinay
Affiliations:
Computer Science Department, University College London, UK;Computer Science Department, University College London, UK;Microsoft Research, Cambridge, UK;Microsoft Research, Cambridge, UK;Microsoft Research, Cambridge, UK
Venue:
ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Year:
2012

Citing 14
Cited 6

Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness

Information Processing and Management: an International Journal
Redundant documents and search effectiveness

Proceedings of the 14th ACM international conference on Information and knowledge management
A statistical method for system evaluation using incomplete judgments

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation over thousands of queries

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Relevance assessment: are judges exchangeable and does it matter

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Financial incentives and the "performance of crowds"

Proceedings of the ACM SIGKDD Workshop on Human Computation
Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
The effect of assessor error on IR system evaluation

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Quality management on Amazon Mechanical Turk

Proceedings of the ACM SIGKDD Workshop on Human Computation
CoBayes: bayesian knowledge corroboration with assessors of unknown areas of expertise

Proceedings of the fourth ACM international conference on Web search and data mining
Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Quantifying test collection quality based on the consistency of relevance judgements

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Measuring assessor accuracy: a comparison of nist assessors and user study participants

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

An uncertainty-aware query selection model for evaluation of IR systems

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Constructing test collections by inferring document relevance via extracted relevant information

Proceedings of the 21st ACM international conference on Information and knowledge management
A document rating system for preference judgements

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A Modification of LambdaMART to Handle Noisy Crowdsourced Assessments

Proceedings of the 2013 Conference on the Theory of Information Retrieval
An analysis of crowd workers mistakes for specific and complex relevance assessment task

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Exploiting user disagreement for web search evaluation: an experimental approach

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of acquiring relevance judgements for information retrieval (IR) test collections through crowdsourcing when no true relevance labels are available. We collect multiple, possibly noisy relevance labels per document from workers of unknown labelling accuracy. We use these labels to infer the document relevance based on two methods. The first method is the commonly used majority voting (MV) which determines the document relevance based on the label that received the most votes, treating all the workers equally. The second is a probabilistic model that concurrently estimates the document relevance and the workers accuracy using expectation maximization (EM). We run simulations and conduct experiments with crowdsourced relevance labels from the INEX 2010 Book Search track to investigate the accuracy and robustness of the relevance assessments to the noisy labels. We observe the effect of the derived relevance judgments on the ranking of the search systems. Our experimental results show that the EM method outperforms the MV method in the accuracy of relevance assessments and IR systems ranking. The performance improvements are especially noticeable when the number of labels per document is small and the labels are of varied quality.