Variations in relevance judgments and the evaluation of retrieval performance
Information Processing and Management: an International Journal
Assessing agreement on classification tasks: the kappa statistic
Computational Linguistics
Variations in relevance assessments and the measurement of retrieval effectiveness
Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Efficient construction of large test collections
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
IR evaluation methods for retrieving highly relevant documents
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Liberal relevance criteria of TREC -: counting on negligible documents?
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
The kappa statistic: a second look
Computational Linguistics
Retrieval evaluation with incomplete information
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
A statistical method for system evaluation using incomplete judgments
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating average precision with incomplete and imperfect judgments
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A comparison of pooled and sampled relevance judgments
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
The CSIRO enterprise search test collection
ACM SIGIR Forum
A simple and efficient sampling method for estimating AP and NDCG
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Inter-coder agreement for computational linguistics
Computational Linguistics
Measuring the agreement among relevance judges
MIRA'99 Proceedings of the 1999 international conference on Final Mira
Survey and evaluation of query intent detection methods
Proceedings of the 2009 workshop on Web Search Click Data
Towards methods for the collective gathering and quality control of relevance assessments
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Methods for Evaluating Interactive Information Retrieval Systems with Users
Foundations and Trends in Information Retrieval
Weighted Rank Correlation in Information Retrieval Evaluation
AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Improving quality of training data for learning to rank using click-through data
Proceedings of the third ACM international conference on Web search and data mining
Automated opinion detection: Implications of the level of agreement between human raters
Information Processing and Management: an International Journal
The effect of assessor error on IR system evaluation
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Extending average precision to graded relevance judgments
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Comparing the sensitivity of information retrieval metrics
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Evaluating search systems using result page context
Proceedings of the third symposium on Information interaction in context
Web search solved?: all result rankings the same?
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A user study of relevance judgments for e-discovery
Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem - Volume 47
Topic Distillation with Query-Dependent Link Connections and Page Characteristics
ACM Transactions on the Web (TWEB)
Evaluating new search engine configurations with pre-existing judgments and clicks
Proceedings of the 20th international conference on World wide web
In search of quality in crowdsourcing for search engine evaluation
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Let's agree to disagree: on the evaluation of vocabulary alignment
Proceedings of the sixth international conference on Knowledge capture
Repeatable and reliable search system evaluation using crowdsourcing
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Quantifying test collection quality based on the consistency of relevance judgements
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
The effects of choice in routing relevance judgments
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Selecting optimal training data for learning to rank
Information Processing and Management: an International Journal
A probabilistic method for inferring preferences from clicks
Proceedings of the 20th ACM international conference on Information and knowledge management
A nugget-based test collection construction paradigm
Proceedings of the 20th ACM international conference on Information and knowledge management
IR system evaluation using nugget-based test collections
Proceedings of the fifth ACM international conference on Web search and data mining
A noise-tolerant graphical model for ranking
Information Processing and Management: an International Journal
Using anchor text for homepage and topic distillation search tasks
Journal of the American Society for Information Science and Technology
On aggregating labels from multiple crowd workers to infer relevance of documents
ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
An IR-based evaluation framework for web search query segmentation
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Impact of assessor disagreement on ranking performance
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
On judgments obtained from a commercial search engine
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Using crowdsourcing for TREC relevance assessment
Information Processing and Management: an International Journal
An analysis of systematic judging errors in information retrieval
Proceedings of the 21st ACM international conference on Information and knowledge management
Alternative assessor disagreement and retrieval depth
Proceedings of the 21st ACM international conference on Information and knowledge management
Proceedings of the 21st ACM international conference on Information and knowledge management
Better than their reputation? on the reliability of relevance assessments with students
CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
Differences in search engine evaluations between query owners and non-owners
Proceedings of the sixth ACM international conference on Web search and data mining
Optimizing nDCG gains by minimizing effect of label inconsistency
ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
An analysis of human factors and label accuracy in crowdsourcing relevance judgments
Information Retrieval
Identifying top news using crowdsourcing
Information Retrieval
Document features predicting assessor disagreement
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Repeatable and reliable semantic search evaluation
Web Semantics: Science, Services and Agents on the World Wide Web
Choices in batch information retrieval evaluation
Proceedings of the 18th Australasian Document Computing Symposium
Merging algorithms for enterprise search
Proceedings of the 18th Australasian Document Computing Symposium
Exploiting user disagreement for web search evaluation: an experimental approach
Proceedings of the 7th ACM international conference on Web search and data mining
Evaluation in Music Information Retrieval
Journal of Intelligent Information Systems
Hi-index | 0.00 |
We investigate to what extent people making relevance judgements for a reusable IR test collection are exchangeable. We consider three classes of judge: "gold standard" judges, who are topic originators and are experts in a particular information seeking task; "silver standard" judges, who are task experts but did not create topics; and "bronze standard" judges, who are those who did not define topics and are not experts in the task. Analysis shows low levels of agreement in relevance judgements between these three groups. We report on experiments to determine if this is sufficient to invalidate the use of a test collection for measuring system performance when relevance assessments have been created by silver standard or bronze standard judges. We find that both system scores and system rankings are subject to consistent but small differences across the three assessment sets. It appears that test collections are not completely robust to changes of judge when these judges vary widely in task and topic expertise. Bronze standard judges may not be able to substitute for topic and task experts, due to changes in the relative performance of assessed systems, and gold standard judges are preferred.