On the choice of effectiveness measures for learning to rank

Authors:
Emine Yilmaz;Stephen Robertson
Affiliations:
Microsoft Research Cambridge, Cambridge, UK CB3 0FB;Microsoft Research Cambridge, Cambridge, UK CB3 0FB
Venue:
Information Retrieval
Year:
2010

Citing 14
Cited 9

IR evaluation methods for retrieving highly relevant documents

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
The maximum entropy method for analyzing retrieval measures

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Learning to rank using gradient descent

ICML '05 Proceedings of the 22nd international conference on Machine learning
A support vector method for multivariate performance measures

ICML '05 Proceedings of the 22nd international conference on Machine learning
Optimisation methods for ranking functions with multiple parameters

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
On rank-based effectiveness measures and optimization

Information Retrieval
A support vector method for optimizing average precision

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
AdaRank: a boosting algorithm for information retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
SoftRank: optimizing non-smooth rank metrics

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
A new interpretation of average precision

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Precision-at-ten considered redundant

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Estimating average precision when judgments are incomplete

Knowledge and Information Systems
AUC: a statistically consistent and more discriminating measure than accuracy

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Gradient descent optimization of smoothed information retrieval metrics

Information Retrieval
On statistical analysis and optimization of information retrieval effectiveness metrics

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Tendency correlation analysis for direct optimization of evaluation measures in information retrieval

Information Retrieval
On the informativeness of cascade and intent-aware effectiveness measures

Proceedings of the 20th international conference on World wide web
On the suitability of diversity metrics for learning-to-rank for diversity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
On smoothing average precision

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Robust query rewriting using anchor data

Proceedings of the sixth ACM international conference on Web search and data mining
Two-Stage learning to rank for information retrieval

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
The whens and hows of learning to rank for web search

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most current machine learning methods for building search engines are based on the assumption that there is a target evaluation metric that evaluates the quality of the search engine with respect to an end user and the engine should be trained to optimize for that metric. Treating the target evaluation metric as a given, many different approaches (e.g. LambdaRank, SoftRank, RankingSVM, etc.) have been proposed to develop methods for optimizing for retrieval metrics. Target metrics used in optimization act as bottlenecks that summarize the training data and it is known that some evaluation metrics are more informative than others. In this paper, we consider the effect of the target evaluation metric on learning to rank. In particular, we question the current assumption that retrieval systems should be designed to directly optimize for a metric that is assumed to evaluate user satisfaction. We show that even if user satisfaction can be measured by a metric X, optimizing the engine on a training set for a more informative metric Y may result in a better test performance according to X (as compared to optimizing the engine directly for X on the training set). We analyze the situations as to when there is a significant difference in the two cases in terms of the amount of available training data and the number of dimensions of the feature space.