Overview of the second text retrieval conference (TREC-2)
TREC-2 Proceedings of the second conference on Text retrieval conference
How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The Philosophy of Information Retrieval Evaluation
CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
Minimal test collections for retrieval evaluation
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Robust test collections for retrieval evaluation
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Reliable information retrieval evaluation with incomplete and biased judgements
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Hypothesis testing with incomplete relevance judgments
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Statistical power in retrieval experimentation
Proceedings of the 17th ACM conference on Information and knowledge management
Comparing metrics across TREC and NTCIR: the robustness to system bias
Proceedings of the 17th ACM conference on Information and knowledge management
On rank correlation and the distance between rankings
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Measuring the reusability of test collections
Proceedings of the third ACM international conference on Web search and data mining
Pseudo test collections for learning web search ranking functions
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Selecting a subset of queries for acquisition of further relevance judgements
ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Prioritizing relevance judgments to improve the construction of IR test collections
Proceedings of the 20th ACM international conference on Information and knowledge management
Evaluation in Music Information Retrieval
Journal of Intelligent Information Systems
Hi-index | 0.00 |
Portable, reusable test collections are a vital part of research and development in information retrieval. Reusability is difficult to assess, however. The standard approach--simulating judgment collection when groups of systems are held out, then evaluating those held-out systems--only works when there is a large set of relevance judgments to draw on during the simulation. As test collections adapt to larger and larger corpora, it becomes less and less likely that there will be sufficient judgments for such simulation experiments. Thus we propose a methodology for information retrieval experimentation that collects evidence for or against the reusability of a test collection while judgments are being made. Using this methodology along with the appropriate statistical analyses, researchers will be able to estimate the reusability of their test collections while building them and implement "course corrections" if the collection does not seem to be achieving desired levels of reusability. We show the robustness of our design to inherent sources of variance, and provide a description of an actual implementation of the framework for creating a large test collection.