Statistical inference in retrieval effectiveness evaluation
Information Processing and Management: an International Journal
How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation measure stability
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation by highly relevant documents
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval
Modern Information Retrieval
Cumulated gain-based evaluation of IR techniques
ACM Transactions on Information Systems (TOIS)
The Philosophy of Information Retrieval Evaluation
CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
Scaling IR-system evaluation using term relevance sets
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Retrieval evaluation with incomplete information
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Combining eye movements and collaborative filtering for proactive information retrieval
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Accurately interpreting clickthrough data as implicit feedback
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
A statistical method for system evaluation using incomplete judgments
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating average precision with incomplete and imperfect judgments
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Exploring cost-effective approaches to human evaluation of search engine relevance
ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
A simple and efficient sampling method for estimating AP and NDCG
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Comparing metrics across TREC and NTCIR: the robustness to system bias
Proceedings of the 17th ACM conference on Information and knowledge management
Measuring the Search Effectiveness of a Breadth-First Crawl
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Foundations and Trends in Information Retrieval
CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Evaluation effort, reliability and reusability in XML retrieval
Journal of the American Society for Information Science and Technology
Hi-index | 0.00 |
We investigate the robustness of three widely used IR relevance measures for large data collections with incomplete judgments. The relevance measures we consider are the bpref measure introduced by Buckley and Voorhees [7], the inferred average precision (infAP) introduced by Aslam and Yilmaz [4], and the normalized discounted cumulative gain (NDCG) measure introduced by Järvelin and Kekäläinen [8]. Our main results show that NDCG consistently performs better than both bpref and infAP. The experiments are performed on standard TREC datasets, under different levels of incompleteness of judgments, and using two different evaluation methods, namely, the Kendall correlation measures order between system rankings and pairwise statistical significance testing; the latter may be of independent interest.