Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes

  • Authors:
  • Mark D. Smucker;James Allan;Ben Carterette

  • Affiliations:
  • University of Waterloo, Waterloo, ON, Canada;University of Massachusetts Amherst, Amherst, MA, USA;University of Delaware, Newark, DE, USA

  • Venue:
  • Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Research has shown that little practical difference exists between the randomization, Student's paired t, and bootstrap tests of statistical significance for TREC ad-hoc retrieval experiments with 50 topics. We compared these three tests on runs with topic sizes down to 10 topics. We found that these tests show increasing disagreement as the number of topics decreases. At smaller numbers of topics, the randomization test tended to produce smaller p-values than the t-test for p-values less than 0.1. The bootstrap exhibited a systematic bias towards p-values strictly less than the t-test with this bias increasing as the number of topics decreased. We recommend the use of the randomization test although the t-test appears to be suitable even when the number of topics is small.