How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation measure stability
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Hits hits TREC: exploring IR evaluation results with network analysis
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Score standardization for inter-collection comparison of retrieval systems
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Crowdsourcing for relevance evaluation
ACM SIGIR Forum
Relevance criteria for e-commerce: a crowdsourcing-based experimental analysis
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
A few good topics: Experiments in topic set reduction for retrieval evaluation
ACM Transactions on Information Systems (TOIS)
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
On the contributions of topics to system evaluation
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
On Using Fewer Topics in Information Retrieval Evaluations
Proceedings of the 2013 Conference on the Theory of Information Retrieval
Hi-index | 0.00 |
Usually, system effectiveness evaluation in a TREC-like environment is performed on a common set of topics. We show that even when using different topics for different systems, a reliable evaluation can be obtained, and that reliability increases by using appropriate topic selection strategies and metric normalizations.