Boiling down information retrieval test collections

Authors:
Tetsuya Sakai;Teruko Mitamura
Affiliations:
Microsoft Research Asia, China;Carnegie Mellon University
Venue:
RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Year:
2010

Citing 21
Cited 1

Efficient construction of large test collections

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Ranking retrieval systems without relevance judgments

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
The Philosophy of Information Retrieval Evaluation

CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
Retrieval evaluation with incomplete information

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Forming test collections with no system pooling

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Learning to rank using gradient descent

ICML '05 Proceedings of the 22nd international conference on Machine learning
Hypothesis testing with incomplete relevance judgments

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
On the history of evaluation in IR

Journal of Information Science
On information retrieval metrics designed for evaluation with incomplete relevance assessments

Information Retrieval
Evaluation over thousands of queries

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Statistical power in retrieval experimentation

Proceedings of the 17th ACM conference on Information and knowledge management
On rank correlation and the distance between rankings

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Score adjustment for correction of pooling bias

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Agreement among statistical significance tests for information retrieval evaluation at varying sample sizes

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Topic (query) selection for IR evaluation

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Topic set size redux

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
A few good topics: Experiments in topic set reduction for retrieval evaluation

ACM Transactions on Information Systems (TOIS)

Diversified search evaluation: lessons from the NTCIR-9 INTENT task

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Constructing large-scale test collections is costly and time-consuming, and a few relevance assessment methods have been proposed for constructing "minimal" information retrieval test collections that may still provide reliable experimental results. In contrast to building up such test collections, we take existing test collections constructed through the traditional pooling approach and empirically investigate whether they can be "boiled down." More specifically, we report on experiments with test collections from both NT-CIR and TREC to investigate the effect of reducing both the topic set size and the pool depth on the outcome of a statistical significance test between two systems, starting with (approximately) 100 topics and depth-100 pools. We define cost (of manual relevance assessment) as the pool depth multiplied by the topic set size, and error as a system pair whose outcome of statistical significance testing differs from the original result based on the full test collection. Our main findings are: (a) Cost and the number of errors are negatively correlated, and any attempt at substantially reducing cost introduces some errors; (b) The NTCIR-7 IR4QA and the TREC 2004 robust track test collections all yield a comparable and considerable number of errors in response to cost reduction, and this is true despite the fact that the TREC relevance assessments relied on more than twice as many runs as the NTCIR ones; (c) Using 100 topics with depth-30 pools generally yields fewer errors than using 30 topics with depth-100 pools; and (d) Even with depth-100 pools, using fewer than 100 topics results in false alarms, i.e. two systems are declared significantly different even though the full topic set would declare otherwise.