How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval
Retrieval evaluation with incomplete information
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
Bias and the limits of pooling
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating average precision with incomplete and imperfect judgments
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Robust test collections for retrieval evaluation
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Reliable information retrieval evaluation with incomplete and biased judgements
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Repeatable evaluation of search services in dynamic environments
ACM Transactions on Information Systems (TOIS)
Inferring document relevance from incomplete information
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Semiautomatic evaluation of retrieval systems using document similarities
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Comparing metrics across TREC and NTCIR: the robustness to system bias
Proceedings of the 17th ACM conference on Information and knowledge management
Low-cost and robust evaluation of information retrieval systems
Low-cost and robust evaluation of information retrieval systems
Reusable test collections through experimental design
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Pseudo test collections for learning web search ranking functions
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Selecting a subset of queries for acquisition of further relevance judgements
ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Prioritizing relevance judgments to improve the construction of IR test collections
Proceedings of the 20th ACM international conference on Information and knowledge management
Evaluating large-scale distributed vertical search
Proceedings of the 9th workshop on Large-scale and distributed informational retrieval
Optimizing the cost of information retrieval testcollections
Proceedings of the 4th workshop on Workshop for Ph.D. students in information & knowledge management
Evaluation in Music Information Retrieval
Journal of Intelligent Information Systems
Hi-index | 0.00 |
While test collection construction is a time-consuming and expensive process, the true cost is amortized by reusing the collection over hundreds or thousands of experiments. Some of these experiments may involve systems that retrieve documents not judged during the initial construction phase, and some of these systems may be "hard" to evaluate: depending on which judgments are missing and which judged documents were retrieved, the experimenter's confidence in an evaluation could potentially be very low. We propose two methods for quantifying the reusability of a test collection for evaluating new systems. The proposed methods provide simple yet highly effective tests for determining whether an existing set of judgments is useful for evaluating a new system. Empirical evaluations using TREC datasets confirm the usefulness of our proposed reusability measures. In particular, we show that our methods can reliably estimate confidence intervals that are indicative of collection reusability.