Low-cost and robust evaluation of information retrieval systems

Authors:
James Allan;Benjamin A. Carterette
Affiliations:
University of Massachusetts Amherst;University of Massachusetts Amherst
Venue:
Low-cost and robust evaluation of information retrieval systems
Year:
2008

Citing 0
Cited 5

Measuring the reusability of test collections

Proceedings of the third ACM international conference on Web search and data mining
Pseudo test collections for learning web search ranking functions

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Cross-corpus relevance projection

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Towards minimal test collections for evaluation of audio music similarity and retrieval

Proceedings of the 21st international conference companion on World Wide Web
News vertical search: when and what to display to users

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Research in Information Retrieval has progressed against a background of rapidly increasing corpus size and heterogeneity, with every advance in technology quickly followed by a desire to organize and search more unstructured, more heterogeneous, and even bigger corpora. But as retrieval problems get larger and more complicated, evaluating the ranking performance of a retrieval engine gets harder: evaluation requires human judgments of the relevance of documents to queries, and for very large corpora the cost of acquiring these judgments may be insurmountable. This cost limits the types of problems researchers can study as well as the data they can be studied on.We present methods for understanding performance differences between retrieval engines in the presence of missing and noisy relevance judgments. The work introduces a model of the cost of experimentation that incorporates the cost of human judgments as well as the cost of drawing incorrect conclusions about differences between engines in both the training and testing phases of engine development. Through adopting a view of evaluation that is more concerned with distributions over performance differences rather than estimates of absolute performance, the expected cost can be minimized so as to reliably differentiate between engines with less than 1% of the human effort that has been used in past experiments.