On per-topic variance in IR evaluation

  • Authors:
  • Stephen E. Robertson;Evangelos Kanoulas

  • Affiliations:
  • Microsoft Research, Cambridge, United Kingdom;University of Sheffield, Sheffield, United Kingdom

  • Venue:
  • SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

We explore the notion, put forward by Cormack & Lynam and Robertson, that we should consider a document collection used for Cranfield-style experiments as a sample from some larger population of documents. In this view, any per-topic metric (such as average precision) should be regarded as an estimate of that metric's true value for that topic in the full population, and therefore as carrying its own per-topic variance or estimate precision or noise. As in the two mentioned papers, we explore this notion by simulating other samples from the same large population. We investigate different ways of performing this simulation. One use of this analysis is to refine the notion of statistical significance of a difference between two systems (in most such analyses, each per-topic measurement is treated as equally precise). We propose a mixed-effects model method to measure significance, and compare it experimentally with the traditional t-test.