Information Retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
On GMAP: and other transformations
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A comparison of statistical significance tests for information retrieval evaluation
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Probabilistic optimized ranking for multimedia semantic concept detection via RVM
CIVR '08 Proceedings of the 2008 international conference on Content-based image and video retrieval
Statistical power in retrieval experimentation
Proceedings of the 17th ACM conference on Information and knowledge management
Computers in Biology and Medicine
Addressing morphological variation in alphabetic languages
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Score aggregation techniques in retrieval experimentation
ADC '09 Proceedings of the Twentieth Australasian Conference on Australasian Database - Volume 92
Combining relevancy and methodological quality into a single ranking for evidence-based medicine
Information Sciences: an International Journal
Deciding on an adjustment for multiplicity in IR experiments
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Semantic concept-enriched dependence model for medical information retrieval
Journal of Biomedical Informatics
Hi-index | 0.00 |
We examine the validity and power of the t-test, Wilcoxon test, and sign test in determining whether or not the difference in performance between two IR systems is significant. Empirical tests conducted on subsets of the TREC2004 Robust Retrieval collection indicate that the p-values computed by these tests for the difference in mean average precision (MAP) between two systems are very accurate fora wide range of sample sizes and significance estimates. Similarly, these tests have good power, with the t-test proving superior overall. The t-test is also valid for comparing geometric mean average precision (GMAP), exhibiting slightly superior accuracy and slightly inferior power than for MAPcomparison.