Principles for robust evaluation infrastructure

Authors:
Justin Zobel;William Webber;Mark Sanderson;Alistair Moffat
Affiliations:
The University of Melbourne, Melbourne, Australia;The University of Melbourne, Melbourne, Australia;RMIT University, Melbourne, Australia;The University of Melbourne, Melbourne, Australia
Venue:
Proceedings of the 2011 workshop on Data infrastructurEs for supporting information retrieval evaluation
Year:
2011

Citing 5
Cited 3

Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Challenges in adopting speech recognition

Communications of the ACM - Multimodal interfaces that flex, adapt, and persist
Evaluation over thousands of queries

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Statistical power in retrieval experimentation

Proceedings of the 17th ACM conference on Information and knowledge management
Improvements that don't add up: ad-hoc retrieval results since 1998

Proceedings of the 18th ACM conference on Information and knowledge management

DESIRE 2011: workshop on data infrastructurEs for supporting information retrieval evaluation

ACM SIGIR Forum
Deciding on an adjustment for multiplicity in IR experiments

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Evaluation in Music Information Retrieval

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The standard "Cranfield" approach to the evaluation of information retrieval systems has been used and refined for nearly fifty years, and has been a key element in the development of large-scale retrieval systems. The resources created by such systematic evaluations have enabled thorough retrospective investigation of the strengths and limitations of particular variants of this evaluation approach; over the last few years, such investigation has for example led to identification of serious flaws in some experiments. Knowledge of these flaws can prevent their perpetuation into future work and informs the design of new experiments and infrastructures. In this position statement we briefly review some aspects of evaluation and, based on our research and observations over the last decade, outline some principles on which we believe new infrastructure should rest.