Sampling dilemma: towards effective data sampling for click prediction in sponsored search

Authors:
Jun Feng;Jiang Bian;Taifeng Wang;Wei Chen;Xiaoyan Zhu;Tie-Yan Liu
Affiliations:
Tsinghua University, Beijing, China;Microsoft Research, Beijing, China;Microsoft Research, Beijing, China;Microsoft Research, Beijing, China;Tsinghua University, Beijing, China;Microsoft Research, Beijing, China
Venue:
Proceedings of the 7th ACM international conference on Web search and data mining
Year:
2014

Citing 11
Cited 0

A maximum entropy approach to natural language processing

Computational Linguistics
Predicting clicks: estimating the click-through rate for new ads

Proceedings of the 16th international conference on World Wide Web
The influence of caption features on clickthrough patterns in web search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Query rewriting using active learning for sponsored search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Keyword generation for search engine advertising using semantic similarity between terms

Proceedings of the ninth international conference on Electronic commerce
Optimizing relevance and revenue in ad search: a query substitution approach

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Modeling and predicting user behavior in sponsored search

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Data-driven text features for sponsored search click prediction

Proceedings of the Third International Workshop on Data Mining and Audience Intelligence for Advertising
Personalized click prediction in sponsored search

Proceedings of the third ACM international conference on Web search and data mining
Temporal click model for sponsored search

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Relational click prediction for sponsored search

Proceedings of the fifth ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Precise prediction of the probability that users click on ads plays a key role in sponsored search. State-of-the-art sponsored search systems typically employ a machine learning approach to conduct click prediction. While paying much attention to extracting useful features and building effective models, previous studies have overshadowed seemingly less obvious but essentially important challenges in terms of data sampling. To fulfill the learning objective of click prediction, it is not only necessary to ensure that the sampled training data implies the similar input distribution compared with the real world one, but also to guarantee that the sampled training data yield the consistent conditional output distribution, i.e. click-through rate (CTR), with the real world data. However, due to the sparseness of clicks in sponsored search, it is a bit contradictory to address these two challenges simultaneously. In this paper, we first take a theoretical analysis to reveal this sampling dilemma, followed by a thorough data analysis which demonstrates that the straightforward random sampling method may not be effective to balance these two kinds of consistency in sampling dilemma simultaneously. To address this problem, we propose a new sampling algorithm which can succeed in retaining the consistency between the sampled data and real world in terms of both input distribution and conditional output distribution. Large scale evaluations on the click-through logs from a commercial search engine demonstrate that this new sampling algorithm can effectively address the sampling dilemma. Further experiments illustrate that, by using the training data obtained by our new sampling algorithm, we can learn the model with much higher accuracy in click prediction.