Reusable test collections through experimental design

Authors:
Ben Carterette;Evangelos Kanoulas;Virgil Pavlu;Hui Fang
Affiliations:
University of Delaware, Newark, DE, USA;University of Sheffield, Sheffield, United Kingdom;Northeastern University, Boston, MA, USA;University of Delaware, Newark, DE, USA
Venue:
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Year:
2010

Citing 11
Cited 4

Overview of the second text retrieval conference (TREC-2)

TREC-2 Proceedings of the second conference on Text retrieval conference
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The Philosophy of Information Retrieval Evaluation

CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
Minimal test collections for retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Robust test collections for retrieval evaluation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Reliable information retrieval evaluation with incomplete and biased judgements

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Hypothesis testing with incomplete relevance judgments

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Statistical power in retrieval experimentation

Proceedings of the 17th ACM conference on Information and knowledge management
Comparing metrics across TREC and NTCIR: the robustness to system bias

Proceedings of the 17th ACM conference on Information and knowledge management
On rank correlation and the distance between rankings

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Measuring the reusability of test collections

Proceedings of the third ACM international conference on Web search and data mining

Pseudo test collections for learning web search ranking functions

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Selecting a subset of queries for acquisition of further relevance judgements

ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Prioritizing relevance judgments to improve the construction of IR test collections

Proceedings of the 20th ACM international conference on Information and knowledge management
Evaluation in Music Information Retrieval

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Portable, reusable test collections are a vital part of research and development in information retrieval. Reusability is difficult to assess, however. The standard approach--simulating judgment collection when groups of systems are held out, then evaluating those held-out systems--only works when there is a large set of relevance judgments to draw on during the simulation. As test collections adapt to larger and larger corpora, it becomes less and less likely that there will be sufficient judgments for such simulation experiments. Thus we propose a methodology for information retrieval experimentation that collects evidence for or against the reusability of a test collection while judgments are being made. Using this methodology along with the appropriate statistical analyses, researchers will be able to estimate the reusability of their test collections while building them and implement "course corrections" if the collection does not seem to be achieving desired levels of reusability. We show the robustness of our design to inherent sources of variance, and provide a description of an actual implementation of the framework for creating a large test collection.