How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation measure stability
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Blind Men and Elephants: Six Approaches to TREC data
Information Retrieval
The effect of topic set size on retrieval experiment error
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Revisiting the effect of topic set size on retrieval error
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
On the reliability of information retrieval metrics based on graded relevance
Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
Test theory for assessing IR test collections
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Test theory for evaluating reliability of IR test collections
Information Processing and Management: an International Journal
A new rank correlation coefficient for information retrieval
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation over thousands of queries
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Empirical justification of the gain and discount function for nDCG
Proceedings of the 18th ACM conference on Information and knowledge management
On per-topic variance in IR evaluation
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Hi-index | 0.00 |
The reliability of a test collection is proportional to the number of queries it contains. But building a collection with many queries is expensive, so researchers have to find a balance between reliability and cost. Previous work on the measurement of test collection reliability relied on data-based approaches that contemplated random what if scenarios, and provided indicators such as swap rates and Kendall tau correlations. Generalizability Theory was proposed as an alternative founded on analysis of variance that provides reliability indicators based on statistical theory. However, these reliability indicators are hard to interpret in practice, because they do not correspond to well known indicators like Kendall tau correlation. We empirically established these relationships based on data from over 40 TREC collections, thus filling the gap in the practical interpretation of Generalizability Theory. We also review the computation of these indicators, and show that they are extremely dependent on the sample of systems and queries used, so much that the required number of queries to achieve a certain level of reliability can vary in orders of magnitude. We discuss the computation of confidence intervals for these statistics, providing a much more reliable tool to measure test collection reliability. Reflecting upon all these results, we review a wealth of TREC test collections, arguing that they are possibly not as reliable as generally accepted and that the common choice of 50 queries is insufficient even for stable rankings.