Validity and power of t-test for comparing MAP and GMAP

Authors:
Gordon V. Cormack;Thomas R. Lynam
Affiliations:
University of Waterloo, Waterloo, ON, Canada;University of Waterloo, Waterloo, ON, Canada
Venue:
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2007

Citing 3
Cited 11

Information Retrieval

Information Retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
On GMAP: and other transformations

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management

A comparison of statistical significance tests for information retrieval evaluation

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Probabilistic optimized ranking for multimedia semantic concept detection via RVM

CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
Statistical power in retrieval experimentation

Proceedings of the 17th ACM conference on Information and knowledge management
A dimensional retrieval model for integrating semantics and statistical evidence in context for genomics literature search

Computers in Biology and Medicine
Addressing morphological variation in alphabetic languages

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Score aggregation techniques in retrieval experimentation

ADC '09 Proceedings of the Twentieth Australasian Conference on Australasian Database - Volume 92
Combining relevancy and methodological quality into a single ranking for evidence-based medicine

Information Sciences: an International Journal
Semantic concept detection for video based on extreme learning machine

Neurocomputing
Deciding on an adjustment for multiplicity in IR experiments

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A comparison of the optimality of statistical significance tests for information retrieval evaluation

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Semantic concept-enriched dependence model for medical information retrieval

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We examine the validity and power of the t-test, Wilcoxon test, and sign test in determining whether or not the difference in performance between two IR systems is significant. Empirical tests conducted on subsets of the TREC2004 Robust Retrieval collection indicate that the p-values computed by these tests for the difference in mean average precision (MAP) between two systems are very accurate fora wide range of sample sizes and significance estimates. Similarly, these tests have good power, with the t-test proving superior overall. The t-test is also valid for comparing geometric mean average precision (GMAP), exhibiting slightly superior accuracy and slightly inferior power than for MAPcomparison.