An analysis of systematic judging errors in information retrieval

Authors:
Gabriella Kazai;Nick Craswell;Emine Yilmaz;S.M.M Tahaghoghi
Affiliations:
Microsoft Research, Cambridge, United Kingdom;Microsoft, Bellevue, WA, USA;Microsoft Research, Cambridge, United Kingdom;Microsoft, Bellevue, WA, USA
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 15
Cited 1

Variations in relevance judgments and the measurement of retrieval effectiveness

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation by highly relevant documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The concept of relevance in IR

Journal of the American Society for Information Science and Technology
Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance

Journal of the American Society for Information Science and Technology
Relevance assessment: are judges exchangeable and does it matter

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
How evaluator domain expertise affects search result relevance judgments

Proceedings of the 17th ACM conference on Information and knowledge management
Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business

Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business
The effect of assessor error on IR system evaluation

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Quality management on Amazon Mechanical Turk

Proceedings of the ACM SIGKDD Workshop on Human Computation
Crowdsourcing document relevance assessment with Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
Assessor error in stratified evaluation

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A comparative analysis of cascade measures for novelty and diversity

Proceedings of the fourth ACM international conference on Web search and data mining
Design and implementation of relevance assessments using crowdsourcing

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Quantifying test collection quality based on the consistency of relevance judgements

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

The effect of threshold priming and need for cognition on relevance calibration and assessment

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Test collections are powerful mechanisms for the evaluation and optimization of information retrieval systems. However, there is reported evidence that experiment outcomes can be affected by changes to the judging guidelines or changes in the judge population. This paper examines such effects in a web search setting, comparing the judgments of four groups of judges: NIST Web Track judges, untrained crowd workers and two groups of trained judges of a commercial search engine. Our goal is to identify systematic judging errors by comparing the labels contributed by the different groups, working under the same or different judging guidelines. In particular, we focus on detecting systematic differences in judging depending on specific characteristics of the queries and URLs. For example, we ask whether a given population of judges, working under a given set of judging guidelines, are more likely to consistently overrate Wikipedia pages than another group judging under the same instructions. Our approach is to identify judging errors with respect to a consensus set, a judged gold set and a set of user clicks. We further demonstrate how such biases can affect the training of retrieval systems.