The effect of assessor error on IR system evaluation

Authors:
Ben Carterette;Ian Soboroff
Affiliations:
University of Delaware, Newark, DE, USA;National Institute of Standards and Technology, Gaithersburg, MD, USA
Venue:
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Year:
2010

Citing 13
Cited 19

Efficient construction of large test collections

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Ranking retrieval systems without relevance judgments

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The Philosophy of Information Retrieval Evaluation

CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)

TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
Minimal test collections for retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A statistical method for system evaluation using incomplete judgments

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating average precision with incomplete and imperfect judgments

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Evaluation over thousands of queries

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Relevance assessment: are judges exchangeable and does it matter

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
How evaluator domain expertise affects search result relevance judgments

Proceedings of the 17th ACM conference on Information and knowledge management

Quantifying test collection quality based on the consistency of relevance judgements

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Worker types and personality traits in crowdsourcing relevance labels

Proceedings of the 20th ACM international conference on Information and knowledge management
On aggregating labels from multiple crowd workers to infer relevance of documents

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
An uncertainty-aware query selection model for evaluation of IR systems

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Impact of assessor disagreement on ranking performance

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Time to judge relevance as an indicator of assessor error

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Using crowdsourcing for TREC relevance assessment

Information Processing and Management: an International Journal
An analysis of systematic judging errors in information retrieval

Proceedings of the 21st ACM international conference on Information and knowledge management
Alternative assessor disagreement and retrieval depth

Proceedings of the 21st ACM international conference on Information and knowledge management
Constructing test collections by inferring document relevance via extracted relevant information

Proceedings of the 21st ACM international conference on Information and knowledge management
An analysis of human factors and label accuracy in crowdsourcing relevance judgments

Information Retrieval
Implementing crowdsourcing-based relevance experimentation: an industrial perspective

Information Retrieval
The effect of threshold priming and need for cognition on relevance calibration and assessment

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Assessor disagreement and text classifier accuracy

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Document features predicting assessor disagreement

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Live nuggets extractor: a semi-automated system for text extraction and test collection creation

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Exploiting user disagreement for web search evaluation: an experimental approach

Proceedings of the 7th ACM international conference on Web search and data mining
Contextual and dimensional relevance judgments for reusable SERP-level evaluation

Proceedings of the 23rd international conference on World wide web
Evaluation in Music Information Retrieval

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent efforts in test collection building have focused on scaling back the number of necessary relevance judgments and then scaling up the number of search topics. Since the largest source of variation in a Cranfield-style experiment comes from the topics, this is a reasonable approach. However, as topic set sizes grow, and researchers look to crowdsourcing and Amazon's Mechanical Turk to collect relevance judgments, we are faced with issues of quality control. This paper examines the robustness of the TREC Million Query track methods when some assessors make significant and systematic errors. We find that while averages are robust, assessor errors can have a large effect on system rankings.