On the measurement of test collection reliability

Authors:
Julián Urbano;Mónica Marrero;Diego Martín
Affiliations:
University Carlos III of Madrid, Leganés, Spain;University Carlos III of Madrid, Leganés, Spain;Technical University of Madrid, Madrid, Spain
Venue:
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Year:
2013

Citing 17
Cited 0

How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Blind Men and Elephants: Six Approaches to TREC data

Information Retrieval
The effect of topic set size on retrieval experiment error

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Revisiting the effect of topic set size on retrieval error

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
On the reliability of information retrieval metrics based on graded relevance

Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
Test theory for assessing IR test collections

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Test theory for evaluating reliability of IR test collections

Information Processing and Management: an International Journal
A new rank correlation coefficient for information retrieval

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation over thousands of queries

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
If I Had a Million Queries

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Topic set size redux

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Empirical justification of the gain and discount function for nDCG

Proceedings of the 18th ACM conference on Information and knowledge management
On per-topic variance in IR evaluation

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
A comparison of the optimality of statistical significance tests for information retrieval evaluation

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The reliability of a test collection is proportional to the number of queries it contains. But building a collection with many queries is expensive, so researchers have to find a balance between reliability and cost. Previous work on the measurement of test collection reliability relied on data-based approaches that contemplated random what if scenarios, and provided indicators such as swap rates and Kendall tau correlations. Generalizability Theory was proposed as an alternative founded on analysis of variance that provides reliability indicators based on statistical theory. However, these reliability indicators are hard to interpret in practice, because they do not correspond to well known indicators like Kendall tau correlation. We empirically established these relationships based on data from over 40 TREC collections, thus filling the gap in the practical interpretation of Generalizability Theory. We also review the computation of these indicators, and show that they are extremely dependent on the sample of systems and queries used, so much that the required number of queries to achieve a certain level of reliability can vary in orders of magnitude. We discuss the computation of confidence intervals for these statistics, providing a much more reliable tool to measure test collection reliability. Reflecting upon all these results, we review a wealth of TREC test collections, arguing that they are possibly not as reliable as generally accepted and that the common choice of 50 queries is insufficient even for stable rankings.