On per-topic variance in IR evaluation

Authors:
Stephen E. Robertson;Evangelos Kanoulas
Affiliations:
Microsoft Research, Cambridge, United Kingdom;University of Sheffield, Sheffield, United Kingdom
Venue:
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Year:
2012

Citing 8
Cited 4

Blind Men and Elephants: Six Approaches to TREC data

Information Retrieval
On Collection Size and Retrieval Effectiveness

Information Retrieval
On ranking the effectiveness of searches

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation metrics based on the bootstrap

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical precision of information retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Modeling the Score Distributions of Relevant and Non-relevant Documents

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Simulating simple user behavior for system effectiveness evaluation

Proceedings of the 20th ACM international conference on Information and knowledge management
On smoothing average precision

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval

On the measurement of test collection reliability

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Bias-variance decomposition of ir evaluation

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Modelling Score Distributions Without Actual Scores

Proceedings of the 2013 Conference on the Theory of Information Retrieval
Bias-variance analysis in estimating true query model for information retrieval

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

We explore the notion, put forward by Cormack & Lynam and Robertson, that we should consider a document collection used for Cranfield-style experiments as a sample from some larger population of documents. In this view, any per-topic metric (such as average precision) should be regarded as an estimate of that metric's true value for that topic in the full population, and therefore as carrying its own per-topic variance or estimate precision or noise. As in the two mentioned papers, we explore this notion by simulating other samples from the same large population. We investigate different ways of performing this simulation. One use of this analysis is to refine the notion of statistical significance of a difference between two systems (in most such analyses, each per-topic measurement is treated as equally precise). We propose a mixed-effects model method to measure significance, and compare it experimentally with the traditional t-test.