Validity and power of t-test for comparing MAP and GMAP

  • Authors:
  • Gordon V. Cormack;Thomas R. Lynam

  • Affiliations:
  • University of Waterloo, Waterloo, ON, Canada;University of Waterloo, Waterloo, ON, Canada

  • Venue:
  • SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We examine the validity and power of the t-test, Wilcoxon test, and sign test in determining whether or not the difference in performance between two IR systems is significant. Empirical tests conducted on subsets of the TREC2004 Robust Retrieval collection indicate that the p-values computed by these tests for the difference in mean average precision (MAP) between two systems are very accurate fora wide range of sample sizes and significance estimates. Similarly, these tests have good power, with the t-test proving superior overall. The t-test is also valid for comparing geometric mean average precision (GMAP), exhibiting slightly superior accuracy and slightly inferior power than for MAPcomparison.