How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Minimal test collections for retrieval evaluation
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Repeatable evaluation of information retrieval effectiveness in dynamic environments
Repeatable evaluation of information retrieval effectiveness in dynamic environments
Extending the Linear Model with R (Texts in Statistical Science)
Extending the Linear Model with R (Texts in Statistical Science)
Power and bias of subset pooling strategies
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of statistical significance tests for information retrieval evaluation
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Modern Applied Statistics with S
Modern Applied Statistics with S
Evaluation over thousands of queries
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Statistical power in retrieval experimentation
Proceedings of the 17th ACM conference on Information and knowledge management
So many topics, so little time
ACM SIGIR Forum
Reusable test collections through experimental design
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Boiling down information retrieval test collections
RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Prioritizing relevance judgments to improve the construction of IR test collections
Proceedings of the 20th ACM international conference on Information and knowledge management
Multiple testing in statistical analysis of systems-based information retrieval experiments
ACM Transactions on Information Systems (TOIS)
Active evaluation of ranking functions based on graded relevance
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II
Active evaluation of ranking functions based on graded relevance
Machine Learning
Diversified search evaluation: lessons from the NTCIR-9 INTENT task
Information Retrieval
Active evaluation of ranking functions based on graded relevance (extended abstract)
IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Evaluation in Music Information Retrieval
Journal of Intelligent Information Systems
Hi-index | 0.00 |
Information retrieval experimentation generally proceeds in a cycle of development, evaluation, and hypothesis testing. Ideally, the evaluation and testing phases should be short and easy, so as to maximize the amount of time spent in development. There has been recent work on reducing the amount of assessor effort needed to evaluate retrieval systems, but it has not, for the most part, investigated the effects of these methods on tests of significance. In this work, we explore in detail the effects of reduced sets of judgments on the sign test. We demonstrate both analytically and empirically the relationship between the power of the test, the number of topics evaluated, and the number of judgments available. Using these relationships, we can determine the number of topics and judgments needed for the least-cost but highest-confidence significance evaluation. Specifically, testing pairwise significance over 192 topics with fewer than 5 judgments for each is as good as testing significance over 25 topics with an average of 166 judgments for each - 85% less effort producing no additional errors.