Extending average precision to graded relevance judgments

Authors:
Stephen E. Robertson;Evangelos Kanoulas;Emine Yilmaz
Affiliations:
Microsoft Research, Cambridge, United Kingdom;University of Sheffield, Sheffield, United Kingdom;Microsoft Research, Cambridge, United Kingdom
Venue:
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Year:
2010

Citing 21
Cited 22

IR evaluation methods for retrieving highly relevant documents

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation by highly relevant documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
Using graded relevance assessments in IR evaluation

Journal of the American Society for Information Science and Technology
The maximum entropy method for analyzing retrieval measures

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Accurately interpreting clickthrough data as implicit feedback

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Learning to rank using gradient descent

ICML '05 Proceedings of the 22nd international conference on Machine learning
Evaluating evaluation metrics based on the bootstrap

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating average precision with incomplete and imperfect judgments

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A support vector method for optimizing average precision

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
AdaRank: a boosting algorithm for information retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
The relationship between IR effectiveness measures and user satisfaction

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
SoftRank: optimizing non-smooth rank metrics

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Relevance assessment: are judges exchangeable and does it matter

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
A new interpretation of average precision

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
On the local optimality of LambdaRank

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Deep versus shallow judgments in learning to rank

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Empirical justification of the gain and discount function for nDCG

Proceedings of the 18th ACM conference on Information and knowledge management
Binary and graded relevance in IR evaluations-Comparison of the effects on ranking of IR systems

Information Processing and Management: an International Journal
LETOR: A benchmark collection for research on learning to rank for information retrieval

Information Retrieval
Ranking the NTCIR systems based on multigrade relevance

AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology

On the informativeness of cascade and intent-aware effectiveness measures

Proceedings of the 20th international conference on World wide web
System effectiveness, user models, and user utility: a conceptual framework for investigation

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Evaluating diversified search results using per-intent graded relevance

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Model-based inference about IR systems

ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Reranking search results for sparse queries

Proceedings of the 20th ACM international conference on Information and knowledge management
Simulating simple user behavior for system effectiveness evaluation

Proceedings of the 20th ACM international conference on Information and knowledge management
Data certification impact on health information retrieval

USAB'11 Proceedings of the 7th conference on Workgroup Human-Computer Interaction and Usability Engineering of the Austrian Computer Society: information Quality in e-Health
Evaluation with informational and navigational intents

Proceedings of the 21st international conference on World Wide Web
Measuring the ability of score distributions to model relevance

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
On theoretically valid score distributions in information retrieval

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Time-based calibration of effectiveness measures

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
On real-time ad-hoc retrieval evaluation

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Advances on the development of evaluation measures

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
On the inference of average precision from score distributions

Proceedings of the 21st ACM international conference on Information and knowledge management
Cumulated relative position: a metric for ranking evaluation

CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
Model Based Comparison of Discounted Cumulative Gain and Average Precision

Journal of Discrete Algorithms
A mutual information-based framework for the analysis of information retrieval systems

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
GAPfm: optimal top-n recommendations for graded relevance domains

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Diversified search evaluation: lessons from the NTCIR-9 INTENT task

Information Retrieval
Exploiting user disagreement for web search evaluation: an experimental approach

Proceedings of the 7th ACM international conference on Web search and data mining
Document Score Distribution Models for Query Performance Inference and Prediction

ACM Transactions on Information Systems (TOIS)
Evaluation in Music Information Retrieval

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Evaluation metrics play a critical role both in the context of comparative evaluation of the performance of retrieval systems and in the context of learning-to-rank (LTR) as objective functions to be optimized. Many different evaluation metrics have been proposed in the IR literature, with average precision (AP) being the dominant one due a number of desirable properties it possesses. However, most of these measures, including average precision, do not incorporate graded relevance. In this work, we propose a new measure of retrieval effectiveness, the Graded Average Precision (GAP). GAP generalizes average precision to the case of multi-graded relevance and inherits all the desirable characteristics of AP: it has a nice probabilistic interpretation, it approximates the area under a graded precision-recall curve and it can be justified in terms of a simple but moderately plausible user model. We then evaluate GAP in terms of its informativeness and discriminative power. Finally, we show that GAP can reliably be used as an objective metric in learning to rank by illustrating that optimizing for GAP using SoftRank and LambdaRank leads to better performing ranking functions than the ones constructed by algorithms tuned to optimize for AP or NDCG even when using AP or NDCG as the test metrics.