A comparison of statistical significance tests for information retrieval evaluation
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Evaluation over thousands of queries
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Boiling down information retrieval test collections
RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Multiple testing in statistical analysis of systems-based information retrieval experiments
ACM Transactions on Information Systems (TOIS)
Multimodal indexing based on semantic cohesion for image retrieval
Information Retrieval
Selecting the n-top retrieval result lists for an effective data fusion
CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Rhetorical relations for information retrieval
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Hi-index | 0.00 |
Research has shown that little practical difference exists between the randomization, Student's paired t, and bootstrap tests of statistical significance for TREC ad-hoc retrieval experiments with 50 topics. We compared these three tests on runs with topic sizes down to 10 topics. We found that these tests show increasing disagreement as the number of topics decreases. At smaller numbers of topics, the randomization test tended to produce smaller p-values than the t-test for p-values less than 0.1. The bootstrap exhibited a systematic bias towards p-values strictly less than the t-test with this bias increasing as the number of topics decreased. We recommend the use of the randomization test although the t-test appears to be suitable even when the number of topics is small.