Topic set size redux

Authors:
Ellen M. Voorhees
Affiliations:
National Institute of Standards and Technology, Gaithersburg, MD, USA
Venue:
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Year:
2009

Citing 7
Cited 7

Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Blind Men and Elephants: Six Approaches to TREC data

Information Retrieval
The effect of topic set size on retrieval experiment error

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation metrics based on the bootstrap

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Score standardization for inter-collection comparison of retrieval systems

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation over thousands of queries

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

CLEF-IP 2009: retrieval experiments in the intellectual property domain

CLEF'09 Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments
Boiling down information retrieval test collections

RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Better than their reputation? on the reliability of relevance assessments with students

CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
Evaluating web archive search systems

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
TREC-Style evaluations

PROMISE'12 Proceedings of the 2012 international conference on Information Retrieval Meets Information Visualization
On the measurement of test collection reliability

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
A comparison of the optimality of statistical significance tests for information retrieval evaluation

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The cost as well as the power and reliability of a retrieval test collection are all proportional to the number of topics included in it. Test collections created through community evaluations such as TREC generally use 50 topics. Prior work estimated the reliability of 50-topic sets by extrapolating confidence levels from those of smaller sets, and concluded that 50 topics are sufficient to have high confidence in a comparison, especially when the comparison is statistically significant. Using topic sets that actually contain 50 topics, this paper shows that statistically significant differences can be wrong, even when statistical significance is accompanied by moderately large (10%) relative differences in scores. Further, using standardized evaluation scores rather than raw evaluation scores does not increase the reliability of these paired comparisons. Researchers should continue to be skeptical of conclusions demonstrated on only a single test collection.