Measuring the agreement among relevance judges

Authors:
Stefano Mizzaro
Affiliations:
Department of Mathematics and Computer Science, University of Udine, Udine, Italy
Venue:
MIRA'99 Proceedings of the 1999 international conference on Final Mira
Year:
1999

Citing 4
Cited 4

Variations in relevance judgments and the evaluation of retrieval performance

Information Processing and Management: an International Journal
Other people's judgments: a comparison of users' and others' judgments of document relevance, topicality, and utility

Journal of the American Society for Information Science - Special issue: relevance research
Measuring retrieval effectiveness based on user preference of documents

Journal of the American Society for Information Science
Variations in relevance assessments and the measurement of retrieval effectiveness

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems

Relevance assessment: are judges exchangeable and does it matter

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Here or there: preference judgments for relevance

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Select-the-Best-Ones: A new way to judge relative relevance

Information Processing and Management: an International Journal
Efficiently collecting relevance information from clickthroughs for web retrieval system evaluation

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The importance of the issue of the agreement (or disagreement) between relevance judges is increasing, since new kinds of relevance judgment expression are being used (to the classical dichotomous one, various researches have added scalar, weighted, and orders of various kind) and new media are being introduced (it is far quicker to judge the relevance of an image than a text, and thus the human judgments can be obtained more easily). This paper presents a coherent account of the disagreement between relevance judges and groups of judges. Judgment expressions of different kinds, grouped into two categories, are taken into account. To the first category, score judgments, belong the more classical dichotomous, scalar, and weighted. To the second one, order judgments, belong total (or linear) and partial (or weak) orders, both with or without equality. A uniform notation for describing relevance judgments of each kind is proposed; some of the problems arising when one tries to operationally measure the disagreement between judges are described; a measure for the disagreement of two judges expressing two judgments of the same kind is proposed; the disagreement of a group of more than two judges is discussed; and, finally, some experimental activity inspired by this study is sketched.