Topic set size redux

  • Authors:
  • Ellen M. Voorhees

  • Affiliations:
  • National Institute of Standards and Technology, Gaithersburg, MD, USA

  • Venue:
  • Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The cost as well as the power and reliability of a retrieval test collection are all proportional to the number of topics included in it. Test collections created through community evaluations such as TREC generally use 50 topics. Prior work estimated the reliability of 50-topic sets by extrapolating confidence levels from those of smaller sets, and concluded that 50 topics are sufficient to have high confidence in a comparison, especially when the comparison is statistically significant. Using topic sets that actually contain 50 topics, this paper shows that statistically significant differences can be wrong, even when statistical significance is accompanied by moderately large (10%) relative differences in scores. Further, using standardized evaluation scores rather than raw evaluation scores does not increase the reliability of these paired comparisons. Researchers should continue to be skeptical of conclusions demonstrated on only a single test collection.