Predictive model performance: offline and online evaluations

Authors:
Jeonghee Yi;Ye Chen;Jie Li;Swaraj Sett;Tak W. Yan
Affiliations:
Microsoft Corp., Mountain View, CA, USA;Microsoft Corp., Mountain View, CA, USA;Microsoft Corp., Mountain View, CA, USA;Microsoft Corp., Mountain View, CA, USA;Microsoft Corp., Mountain View, CA, USA
Venue:
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2013

Citing 19
Cited 0

IR evaluation methods for retrieving highly relevant documents

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Explicitly representing expected cost: an alternative to ROC representation

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
The Case against Accuracy Estimation for Comparing Induction Algorithms

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Crafting Papers on Machine Learning

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Data mining in metric space: an empirical analysis of supervised learning performance criteria

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
The relationship between Precision-Recall and ROC curves

ICML '06 Proceedings of the 23rd international conference on Machine learning
Predicting clicks: estimating the click-through rate for new ads

Proceedings of the 16th international conference on World Wide Web
A noisy-channel approach to contextual advertising

Proceedings of the 1st international workshop on Data mining and audience intelligence for advertising
Online learning from click data for sponsored search

Proceedings of the 17th international conference on World Wide Web
Contextual advertising by combining relevance with click feedback

Proceedings of the 17th international conference on World Wide Web
Rank-biased precision for measurement of retrieval effectiveness

ACM Transactions on Information Systems (TOIS)
An experimental comparison of performance measures for classification

Pattern Recognition Letters
Apples-to-apples in cross-validation studies: pitfalls in classifier performance measurement

ACM SIGKDD Explorations Newsletter
Model characterization curves for federated search using click-logs: predicting user engagement metrics for the span of feasible operating points

Proceedings of the 20th international conference on World wide web
On the informativeness of cascade and intent-aware effectiveness measures

Proceedings of the 20th international conference on World wide web
Sponsored search auctions with conflict constraints

Proceedings of the fifth ACM international conference on Web search and data mining
Post-click conversion modeling and analysis for non-guaranteed delivery display advertising

Proceedings of the fifth ACM international conference on Web search and data mining
Relational click prediction for sponsored search

Proceedings of the fifth ACM international conference on Web search and data mining
Fast and cost-efficient bid estimation for contextual ads

Proceedings of the 21st international conference companion on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the accuracy of evaluation metrics used to estimate the efficacy of predictive models. Offline evaluation metrics are indicators of the expected model performance on real data. However, in practice we often experience substantial discrepancy between the offline and online performance of the models. We investigate the characteristics and behaviors of the evaluation metrics on offline and online testing both analytically and empirically by experimenting them on online advertising data from the Bing search engine. One of our findings is that some offline metrics like AUC (the Area Under the Receiver Operating Characteristic Curve) and RIG (Relative Information Gain) that summarize the model performance on the entire spectrum of operating points could be quite misleading sometimes and result in significant discrepancy in offline and online metrics. For example, for click prediction models for search advertising, errors in predictions in the very low range of predicted click scores impact the online performance much more negatively than errors in other regions. Most of the offline metrics we studied including AUC and RIG, however, are insensitive to such model behavior. We designed a new model evaluation paradigm that simulates the online behavior of predictive models. For a set of ads selected by a new prediction model, the online user behavior is estimated from the historic user behavior in the search logs. The experimental results on click prediction model for search advertising are highly promising.