Sampling dilemma: towards effective data sampling for click prediction in sponsored search

  • Authors:
  • Jun Feng;Jiang Bian;Taifeng Wang;Wei Chen;Xiaoyan Zhu;Tie-Yan Liu

  • Affiliations:
  • Tsinghua University, Beijing, China;Microsoft Research, Beijing, China;Microsoft Research, Beijing, China;Microsoft Research, Beijing, China;Tsinghua University, Beijing, China;Microsoft Research, Beijing, China

  • Venue:
  • Proceedings of the 7th ACM international conference on Web search and data mining
  • Year:
  • 2014

Quantified Score

Hi-index 0.00

Visualization

Abstract

Precise prediction of the probability that users click on ads plays a key role in sponsored search. State-of-the-art sponsored search systems typically employ a machine learning approach to conduct click prediction. While paying much attention to extracting useful features and building effective models, previous studies have overshadowed seemingly less obvious but essentially important challenges in terms of data sampling. To fulfill the learning objective of click prediction, it is not only necessary to ensure that the sampled training data implies the similar input distribution compared with the real world one, but also to guarantee that the sampled training data yield the consistent conditional output distribution, i.e. click-through rate (CTR), with the real world data. However, due to the sparseness of clicks in sponsored search, it is a bit contradictory to address these two challenges simultaneously. In this paper, we first take a theoretical analysis to reveal this sampling dilemma, followed by a thorough data analysis which demonstrates that the straightforward random sampling method may not be effective to balance these two kinds of consistency in sampling dilemma simultaneously. To address this problem, we propose a new sampling algorithm which can succeed in retaining the consistency between the sampled data and real world in terms of both input distribution and conditional output distribution. Large scale evaluations on the click-through logs from a commercial search engine demonstrate that this new sampling algorithm can effectively address the sampling dilemma. Further experiments illustrate that, by using the training data obtained by our new sampling algorithm, we can learn the model with much higher accuracy in click prediction.