Statistical inference in retrieval effectiveness evaluation
Information Processing and Management: an International Journal
How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Retrieval evaluation with incomplete information
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
Evaluating evaluation metrics based on the bootstrap
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A statistical method for system evaluation using incomplete judgments
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating average precision with incomplete and imperfect judgments
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Robust test collections for retrieval evaluation
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Strategic system comparisons via targeted relevance judgments
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Validity and power of t-test for comparing MAP and GMAP
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of statistical significance tests for information retrieval evaluation
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Hypothesis testing with incomplete relevance judgments
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Score standardization for inter-collection comparison of retrieval systems
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation over thousands of queries
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Score adjustment for correction of pooling bias
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
A few good topics: Experiments in topic set reduction for retrieval evaluation
ACM Transactions on Information Systems (TOIS)
Reusable test collections through experimental design
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Boiling down information retrieval test collections
RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Evaluation effort, reliability and reusability in XML retrieval
Journal of the American Society for Information Science and Technology
Quantifying test collection quality based on the consistency of relevance judgements
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Principles for robust evaluation infrastructure
Proceedings of the 2011 workshop on Data infrastructurEs for supporting information retrieval evaluation
Multiple testing in statistical analysis of systems-based information retrieval experiments
ACM Transactions on Information Systems (TOIS)
Experimental methods for information retrieval
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Deciding on an adjustment for multiplicity in IR experiments
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
On Using Fewer Topics in Information Retrieval Evaluations
Proceedings of the 2013 Conference on the Theory of Information Retrieval
Diversified search evaluation: lessons from the NTCIR-9 INTENT task
Information Retrieval
Evaluation in Music Information Retrieval
Journal of Intelligent Information Systems
Hi-index | 0.00 |
The power of a statistical test specifies the sample size required to reliably detect a given true effect. In IR evaluation, the power corresponds to the number of topics that are likely to be sufficient to detect a certain degree of superiority of one system over another. To predict the power of a test, one must estimate the variability of the population being sampled from; here, of between-system score deltas. This paper demonstrates that basing such an estimation either on previous experience or on trial experiments leaves wide margins of error. Iteratively adding more topics to the test set until power is achieved is more efficient; however, we show that it leads to a bias in favour of finding both power and significance. A hybrid methodology is proposed, and the reporting requirements of the experimenter using this methodology are laid out. We also demonstrate that greater statistical power is achieved for the same relevance assessment effort by evaluating a large number of topics shallowly than a small number deeply.