The significance of the Cranfield tests on index languages
SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient construction of large test collections
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation measure stability
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Cumulated gain-based evaluation of IR techniques
ACM Transactions on Information Systems (TOIS)
CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Retrieval evaluation with incomplete information
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
A Markov random field model for term dependencies
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Learning to rank using gradient descent
ICML '05 Proceedings of the 22nd international conference on Machine learning
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
High accuracy retrieval with multiple nested ranker
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating average precision with incomplete and imperfect judgments
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Optimisation methods for ranking functions with multiple parameters
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Estimation, sensitivity, and generalization in parameterized retrieval models
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A support vector method for optimizing average precision
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Learning to rank for information retrieval (LR4IR 2007)
ACM SIGIR Forum
Setting per-field normalisation hyper-parameters for the named-page finding search task
ECIR'07 Proceedings of the 29th European conference on IR research
RankDE: learning a ranking function for information retrieval using differential evolution
Proceedings of the 13th annual conference on Genetic and evolutionary computation
Identifying top news using crowdsourcing
Information Retrieval
About learning models with multiple query-dependent features
ACM Transactions on Information Systems (TOIS)
The whens and hows of learning to rank for web search
Information Retrieval
Hi-index | 0.00 |
Various measures, such as binary preference (bpref), inferred average precision (infAP), and binary normalised discounted cumulative gain (nDCG) have been proposed as alternatives to mean average precision (MAP) for being less sensitive to the relevance judgements completeness. As the primary aim of any system building is to train the system to respond to user queries in a more robust and stable manner, in this paper, we investigate the importance of the choice of the evaluation measure for training, under different levels of evaluation incompleteness. We simulate evaluation incompleteness by sampling from the relevance assessments. Through large-scale experiments on two standard TREC test collections, we examine retrieval sensitivity when training - i.e. if a training process, based on any of the four discussed measures has an impact on the final retrieval performance. Experimental results show that training by bpref, infAP and nDCG provides significantly better retrieval performance than training by MAP when relevance judgements completeness is extremely low. When relevance judgements completeness increases, the measures behave more similarly.