Audience segment expansion using distributed in-database k-means clustering

  • Authors:
  • Archana Ramesh;Ankur Teredesai;Ashish Bindra;Sreenivasulu Pokuri;Krishna Uppala

  • Affiliations:
  • nPario Inc., Redmond, WA;University of Washington;nPario Inc., Redmond, WA;nPario Inc., Redmond, WA;nPario Inc., Redmond, WA

  • Venue:
  • Proceedings of the Seventh International Workshop on Data Mining for Online Advertising
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Online display advertisers extensively use the concept of a user segment to cluster users into targetable groups. When the sizes of such segments are less than the desired value for campaign budgets, there is a need to use probabilistic modeling to expand the size. This process is termed look-alike modeling. Given the multitude of data providers and on-line data sources, there are thousands of segments for each targetable consumer extracted from billions of online (even offline) actions performed by millions of users. The majority of advertisers, marketers and publishers have to use large scale distributed infrastructures to create thousands of user segments on a daily basis. Developing accurate data mining models efficiently within such platforms is a challenging task. The volume and variety of data can be a significant bottleneck for non-disk resident algorithms, since operating time for training and scoring hundreds of segments with millions of targetable users is non-trivial. In this paper, we present a novel k-means based distributed in-database algorithm for look-alike modeling implemented within the nPario database system. We demonstrate the utility of the algorithm: accurate, invariant of size and skew of the targetable audience(very few positive examples), and dependent linearly on the capacity and number of nodes in the distributed environment. To the best of our knowledge this is the first ever commercially deployed distributed look-alike modeling implementation to solve this problem. We compare the performance of our algorithm with other distributed and non-distributed look-alike modeling techniques, and report the results over a multi-core environment.