Hypothesis testing with incomplete relevance judgments

Authors:
Ben Carterette;Mark D. Smucker
Affiliations:
University of Massachusetts Amherst, Amherst, MA;University of Massachusetts Amherst, Amherst, MA
Venue:
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Year:
2007

Citing 9
Cited 12

How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval

Information Retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Minimal test collections for retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Repeatable evaluation of information retrieval effectiveness in dynamic environments

Repeatable evaluation of information retrieval effectiveness in dynamic environments
Extending the Linear Model with R (Texts in Statistical Science)

Extending the Linear Model with R (Texts in Statistical Science)
Power and bias of subset pooling strategies

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of statistical significance tests for information retrieval evaluation

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Modern Applied Statistics with S

Modern Applied Statistics with S

Evaluation over thousands of queries

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Statistical power in retrieval experimentation

Proceedings of the 17th ACM conference on Information and knowledge management
So many topics, so little time

ACM SIGIR Forum
Reusable test collections through experimental design

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Boiling down information retrieval test collections

RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Prioritizing relevance judgments to improve the construction of IR test collections

Proceedings of the 20th ACM international conference on Information and knowledge management
Multiple testing in statistical analysis of systems-based information retrieval experiments

ACM Transactions on Information Systems (TOIS)
Active evaluation of ranking functions based on graded relevance

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II
Active evaluation of ranking functions based on graded relevance

Machine Learning
Diversified search evaluation: lessons from the NTCIR-9 INTENT task

Information Retrieval
Active evaluation of ranking functions based on graded relevance (extended abstract)

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Evaluation in Music Information Retrieval

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information retrieval experimentation generally proceeds in a cycle of development, evaluation, and hypothesis testing. Ideally, the evaluation and testing phases should be short and easy, so as to maximize the amount of time spent in development. There has been recent work on reducing the amount of assessor effort needed to evaluate retrieval systems, but it has not, for the most part, investigated the effects of these methods on tests of significance. In this work, we explore in detail the effects of reduced sets of judgments on the sign test. We demonstrate both analytically and empirically the relationship between the power of the test, the number of topics evaluated, and the number of judgments available. Using these relationships, we can determine the number of topics and judgments needed for the least-cost but highest-confidence significance evaluation. Specifically, testing pairwise significance over 192 topics with fewer than 5 judgments for each is as good as testing significance over 25 topics with an average of 166 judgments for each - 85% less effort producing no additional errors.