Relevance assessment: are judges exchangeable and does it matter

Authors:
Peter Bailey;Nick Craswell;Ian Soboroff;Paul Thomas;Arjen P. de Vries;Emine Yilmaz
Affiliations:
Microsoft, Redmond, WA, USA;Microsoft, Cambridge, United Kngdm;NIST, Gaithersburg, MD, USA;CSIRO, Canberra, Australia;CWI, Amsterdam, Netherlands;Microsoft Research, Cambridge, United Kngdm
Venue:
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2008

Citing 16
Cited 44

Variations in relevance judgments and the evaluation of retrieval performance

Information Processing and Management: an International Journal
Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics
Variations in relevance assessments and the measurement of retrieval effectiveness

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Efficient construction of large test collections

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
IR evaluation methods for retrieving highly relevant documents

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Liberal relevance criteria of TREC -: counting on negligible documents?

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
The kappa statistic: a second look

Computational Linguistics
Retrieval evaluation with incomplete information

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
A statistical method for system evaluation using incomplete judgments

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating average precision with incomplete and imperfect judgments

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A comparison of pooled and sampled relevance judgments

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
The CSIRO enterprise search test collection

ACM SIGIR Forum
A simple and efficient sampling method for estimating AP and NDCG

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Inter-coder agreement for computational linguistics

Computational Linguistics
Measuring the agreement among relevance judges

MIRA'99 Proceedings of the 1999 international conference on Final Mira

Survey and evaluation of query intent detection methods

Proceedings of the 2009 workshop on Web Search Click Data
Towards methods for the collective gathering and quality control of relevance assessments

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Methods for Evaluating Interactive Information Retrieval Systems with Users

Foundations and Trends in Information Retrieval
Weighted Rank Correlation in Information Retrieval Evaluation

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Improving quality of training data for learning to rank using click-through data

Proceedings of the third ACM international conference on Web search and data mining
Automated opinion detection: Implications of the level of agreement between human raters

Information Processing and Management: an International Journal
The effect of assessor error on IR system evaluation

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Extending average precision to graded relevance judgments

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Comparing the sensitivity of information retrieval metrics

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Evaluating search systems using result page context

Proceedings of the third symposium on Information interaction in context
Web search solved?: all result rankings the same?

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A user study of relevance judgments for e-discovery

Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem - Volume 47
Topic Distillation with Query-Dependent Link Connections and Page Characteristics

ACM Transactions on the Web (TWEB)
Evaluating new search engine configurations with pre-existing judgments and clicks

Proceedings of the 20th international conference on World wide web
In search of quality in crowdsourcing for search engine evaluation

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Let's agree to disagree: on the evaluation of vocabulary alignment

Proceedings of the sixth international conference on Knowledge capture
Repeatable and reliable search system evaluation using crowdsourcing

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Quantifying test collection quality based on the consistency of relevance judgements

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
The effects of choice in routing relevance judgments

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Selecting optimal training data for learning to rank

Information Processing and Management: an International Journal
A probabilistic method for inferring preferences from clicks

Proceedings of the 20th ACM international conference on Information and knowledge management
A nugget-based test collection construction paradigm

Proceedings of the 20th ACM international conference on Information and knowledge management
IR system evaluation using nugget-based test collections

Proceedings of the fifth ACM international conference on Web search and data mining
A noise-tolerant graphical model for ranking

Information Processing and Management: an International Journal
Using anchor text for homepage and topic distillation search tasks

Journal of the American Society for Information Science and Technology
On aggregating labels from multiple crowd workers to infer relevance of documents

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
An IR-based evaluation framework for web search query segmentation

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Impact of assessor disagreement on ranking performance

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
On judgments obtained from a commercial search engine

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Using crowdsourcing for TREC relevance assessment

Information Processing and Management: an International Journal
An analysis of systematic judging errors in information retrieval

Proceedings of the 21st ACM international conference on Information and knowledge management
Alternative assessor disagreement and retrieval depth

Proceedings of the 21st ACM international conference on Information and knowledge management
The face of quality in crowdsourcing relevance labels: demographics, personality and labeling accuracy

Proceedings of the 21st ACM international conference on Information and knowledge management
Better than their reputation? on the reliability of relevance assessments with students

CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
Differences in search engine evaluations between query owners and non-owners

Proceedings of the sixth ACM international conference on Web search and data mining
Optimizing nDCG gains by minimizing effect of label inconsistency

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
An analysis of human factors and label accuracy in crowdsourcing relevance judgments

Information Retrieval
Identifying top news using crowdsourcing

Information Retrieval
Document features predicting assessor disagreement

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Repeatable and reliable semantic search evaluation

Web Semantics: Science, Services and Agents on the World Wide Web
Choices in batch information retrieval evaluation

Proceedings of the 18th Australasian Document Computing Symposium
Merging algorithms for enterprise search

Proceedings of the 18th Australasian Document Computing Symposium
Exploiting user disagreement for web search evaluation: an experimental approach

Proceedings of the 7th ACM international conference on Web search and data mining
Evaluation in Music Information Retrieval

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate to what extent people making relevance judgements for a reusable IR test collection are exchangeable. We consider three classes of judge: "gold standard" judges, who are topic originators and are experts in a particular information seeking task; "silver standard" judges, who are task experts but did not create topics; and "bronze standard" judges, who are those who did not define topics and are not experts in the task. Analysis shows low levels of agreement in relevance judgements between these three groups. We report on experiments to determine if this is sufficient to invalidate the use of a test collection for measuring system performance when relevance assessments have been created by silver standard or bronze standard judges. We find that both system scores and system rankings are subject to consistent but small differences across the three assessment sets. It appears that test collections are not completely robust to changes of judge when these judges vary widely in task and topic expertise. Bronze standard judges may not be able to substitute for topic and task experts, due to changes in the relative performance of assessed systems, and gold standard judges are preferred.