Variations in relevance judgments and the evaluation of retrieval performance
Information Processing and Management: an International Journal
Journal of the American Society for Information Science - Special issue: relevance research
Measuring retrieval effectiveness based on user preference of documents
Journal of the American Society for Information Science
Variations in relevance assessments and the measurement of retrieval effectiveness
Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Relevance assessment: are judges exchangeable and does it matter
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Here or there: preference judgments for relevance
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Select-the-Best-Ones: A new way to judge relative relevance
Information Processing and Management: an International Journal
Efficiently collecting relevance information from clickthroughs for web retrieval system evaluation
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Hi-index | 0.00 |
The importance of the issue of the agreement (or disagreement) between relevance judges is increasing, since new kinds of relevance judgment expression are being used (to the classical dichotomous one, various researches have added scalar, weighted, and orders of various kind) and new media are being introduced (it is far quicker to judge the relevance of an image than a text, and thus the human judgments can be obtained more easily). This paper presents a coherent account of the disagreement between relevance judges and groups of judges. Judgment expressions of different kinds, grouped into two categories, are taken into account. To the first category, score judgments, belong the more classical dichotomous, scalar, and weighted. To the second one, order judgments, belong total (or linear) and partial (or weak) orders, both with or without equality. A uniform notation for describing relevance judgments of each kind is proposed; some of the problems arising when one tries to operationally measure the disagreement between judges are described; a measure for the disagreement of two judges expressing two judgments of the same kind is proposed; the disagreement of a group of more than two judges is discussed; and, finally, some experimental activity inspired by this study is sketched.