Evaluating evaluation measure stability
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Blind Men and Elephants: Six Approaches to TREC data
Information Retrieval
The effect of topic set size on retrieval experiment error
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation metrics based on the bootstrap
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Score standardization for inter-collection comparison of retrieval systems
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation over thousands of queries
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
CLEF-IP 2009: retrieval experiments in the intellectual property domain
CLEF'09 Proceedings of the 10th cross-language evaluation forum conference on Multilingual information access evaluation: text retrieval experiments
Boiling down information retrieval test collections
RIAO '10 Adaptivity, Personalization and Fusion of Heterogeneous Information
Better than their reputation? on the reliability of relevance assessments with students
CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
Evaluating web archive search systems
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
PROMISE'12 Proceedings of the 2012 international conference on Information Retrieval Meets Information Visualization
On the measurement of test collection reliability
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Hi-index | 0.00 |
The cost as well as the power and reliability of a retrieval test collection are all proportional to the number of topics included in it. Test collections created through community evaluations such as TREC generally use 50 topics. Prior work estimated the reliability of 50-topic sets by extrapolating confidence levels from those of smaller sets, and concluded that 50 topics are sufficient to have high confidence in a comparison, especially when the comparison is statistically significant. Using topic sets that actually contain 50 topics, this paper shows that statistically significant differences can be wrong, even when statistical significance is accompanied by moderately large (10%) relative differences in scores. Further, using standardized evaluation scores rather than raw evaluation scores does not increase the reliability of these paired comparisons. Researchers should continue to be skeptical of conclusions demonstrated on only a single test collection.