Alternative assessor disagreement and retrieval depth

Authors:
William Webber;Praveen Chandar;Ben Carterette
Affiliations:
University of Maryland, College Park, MD, USA;University of Delaware, Newark, DE, USA;University of Delaware, Newark, DE, USA
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 16
Cited 2

Analyses of multiple evidence combination

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical inference in retrieval effectiveness evaluation

Information Processing and Management: an International Journal
Variations in relevance judgments and the measurement of retrieval effectiveness

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness

Information Processing and Management: an International Journal
Ranking retrieval systems without relevance judgments

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Models for metasearch

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Measure-based metasearch

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Robust test collections for retrieval evaluation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Test theory for assessing IR test collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance

Journal of the American Society for Information Science and Technology
Relevance assessment: are judges exchangeable and does it matter

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Get another label? improving data quality and data mining using multiple, noisy labelers

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Document categorization in legal electronic discovery: computer classification vs. manual review

Journal of the American Society for Information Science and Technology
The effect of assessor error on IR system evaluation

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Assessor error in stratified evaluation

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Effect of written instructions on assessor agreement

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Document features predicting assessor disagreement

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
User intent and assessor disagreement in web search evaluation

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Assessors are well known to disagree frequently on the relevance of documents to a topic, but the factors leading to assessor disagreement are still poorly understood. In this paper, we examine the relationship between the rank at which a document is returned by a set of retrieval systems and the likelihood that a second assessor will disagree with the relevance assessment of the initial assessor, and find that there is a strong and consistent correlation between the two. We adopt a metarank method of summarizing a document's rank across multiple runs, and propose a logistic regression predictive model of second assessor disagreement given metarank and initially-assessed relevance. The consistency of the model parameters across different topics, assessor pairs, and collections is considered. The model gives comparatively accurate predictions of absolute system scores, but less consistent predictions of relative scores than a simpler rank-insensitive model. We demonstrate that the logistic regression model is robust to using sampled, rather than exhaustive, dual assessment. We demonstrate the use of the sampled predictive model to incorporate assessor disagreement into tests of statistical significance.