A comparison of the optimality of statistical significance tests for information retrieval evaluation

  • Authors:
  • Julián Urbano;Mónica Marrero;Diego Martín

  • Affiliations:
  • University Carlos III of Madrid, Leganés, Spain;University Carlos III of Madrid, Leganés, Spain;Technical University of Madrid, Madrid, Spain

  • Venue:
  • Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Previous research has suggested the permutation test as the theoretically optimal statistical significance test for IR evaluation, and advocated for the discontinuation of the Wilcoxon and sign tests. We present a large-scale study comprising nearly 60 million system comparisons showing that in practice the bootstrap, t-test and Wilcoxon test outperform the permutation test under different optimality criteria. We also show that actual error rates seem to be lower than the theoretically expected 5%, further confirming that we may actually be underestimating significance.