Choices in batch information retrieval evaluation

  • Authors:
  • Falk Scholer;Alistair Moffat;Paul Thomas

  • Affiliations:
  • RMIT University, Australia;The University of Melbourne, Australia;CSIRO & ANU, Australia

  • Venue:
  • Proceedings of the 18th Australasian Document Computing Symposium
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web search tools are used on a daily basis by billions of people. The commercial providers of these services spend large amounts of money measuring their own effectiveness and benchmarking against their competitors; nothing less than their corporate survival is at stake. Techniques for offline or "batch" evaluation of search quality have received considerable attention, spanning ways of constructing relevance judgments; ways of using them to generate numeric scores; and ways of inferring system "superiority" from sets of such scores. Our purpose in this paper is consider these mechanisms as a chain of inter-dependent activities, in order to explore some of the ramifications of alternative components. By disaggregating the different activities, and asking what the ultimate objective of the measurement process is, we provide new insights into evaluation approaches, and are able to suggest new combinations that might prove fruitful avenues for exploration. Our observations are examined with reference to data collected from a user study covering 34 users undertaking a total of six search tasks each, using two systems of markedly different quality. We hope to encourage broader awareness of the many factors that go into an evaluation of search effectiveness, and of the implications of these choices, and encourage researchers to carefully report all aspects of the evaluation process when describing their system performance experiments.