Selecting a subset of queries for acquisition of further relevance judgements

Authors:
Mehdi Hosseini;Ingemar J. Cox;Natasa Milic-Frayling;Vishwa Vinay;Trevor Sweeting
Affiliations:
University College London;University College London;Microsoft Research Cambridge;Microsoft Research Cambridge;University College London
Venue:
ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Year:
2011

Citing 11
Cited 4

How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
The Philosophy of Information Retrieval Evaluation

CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
Minimal test collections for retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A statistical method for system evaluation using incomplete judgments

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Hits hits TREC: exploring IR evaluation results with network analysis

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation over thousands of queries

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
A few good topics: Experiments in topic set reduction for retrieval evaluation

ACM Transactions on Information Systems (TOIS)
Measuring the reusability of test collections

Proceedings of the third ACM international conference on Web search and data mining
Reusable test collections through experimental design

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
On the contributions of topics to system evaluation

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval

Prioritizing relevance judgments to improve the construction of IR test collections

Proceedings of the 20th ACM international conference on Information and knowledge management
Optimizing the cost of information retrieval testcollections

Proceedings of the 4th workshop on Workshop for Ph.D. students in information & knowledge management
An uncertainty-aware query selection model for evaluation of IR systems

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
On Using Fewer Topics in Information Retrieval Evaluations

Proceedings of the 2013 Conference on the Theory of Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Assessing the relative performance of search systems requires the use of a test collection with a pre-defined set of queries and corresponding relevance assessments. The state-of-the-art process of constructing test collections involves using a large number of queries and selecting a set of documents, submitted by a group of participating systems, to be judged per query. However, the initial set of judgments may be insufficient to reliably evaluate the performance of future as yet unseen systems. In this paper, we propose a method that expands the set of relevance judgments as new systems are being evaluated. We assume that there is a limited budget to build additional relevance judgements. From the documents retrieved by the new systems we create a pool of unjudged documents. Rather than uniformly distributing the budget across all queries, we first select a subset of queries that are effective in evaluating systems and then uniformly allocate the budget only across these queries. Experimental results on TREC 2004 Robust track test collection demonstrate the superiority of this budget allocation strategy.