Better than their reputation? on the reliability of relevance assessments with students

Authors:
Philipp Schaer
Affiliations:
GESIS, Leibniz Institute for the Social Sciences, Cologne, Germany
Venue:
CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
Year:
2012

Citing 12
Cited 1

Variations in relevance judgments and the measurement of retrieval effectiveness

Information Processing and Management: an International Journal
The concept of relevance in IR

Journal of the American Society for Information Science and Technology
Relevance assessment: are judges exchangeable and does it matter

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
Sound and complete relevance assessment for XML retrieval

ACM Transactions on Information Systems (TOIS)
Inter-coder agreement for computational linguistics

Computational Linguistics
A study of inter-annotator agreement for opinion retrieval

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Topic set size redux

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Select-the-Best-Ones: A new way to judge relative relevance

Information Processing and Management: an International Journal
A methodology for evaluating aggregated search results

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Science models as value-added services for scholarly information systems

Scientometrics
Crowdsourcing assessments for XML ranked retrieval

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval

Vertical selection in the information domain of children

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

During the last three years we conducted several information retrieval evaluation series with more than 180 LIS students who made relevance assessments on the outcomes of three specific retrieval services. In this study we do not focus on the retrieval performance of our system but on the relevance assessments and the inter-assessor reliability. To quantify the agreement we apply Fleiss' Kappa and Krippendorff's Alpha. When we compare these two statistical measures on average Kappa values were 0.37 and Alpha values 0.15. We use the two agreement measures to drop too unreliable assessments from our data set. When computing the differences between the unfiltered and the filtered data set we see a root mean square error between 0.02 and 0.12. We see this as a clear indicator that disagreement affects the reliability of retrieval evaluations. We suggest not to work with unfiltered results or to clearly document the disagreement rates.