Audience segment expansion using distributed in-database k-means clustering

Authors:
Archana Ramesh;Ankur Teredesai;Ashish Bindra;Sreenivasulu Pokuri;Krishna Uppala
Affiliations:
nPario Inc., Redmond, WA;University of Washington;nPario Inc., Redmond, WA;nPario Inc., Redmond, WA;nPario Inc., Redmond, WA
Venue:
Proceedings of the Seventh International Workshop on Data Mining for Online Advertising
Year:
2013

Citing 10
Cited 0

Data clustering: a review

ACM Computing Surveys (CSUR)
C-store: a column-oriented DBMS

VLDB '05 Proceedings of the 31st international conference on Very large data bases
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Column-stores vs. row-stores: how different are they really?

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Column-oriented database systems

Proceedings of the VLDB Endowment
A feature-pair-based associative classification approach to look-alike modeling for conversion-oriented user-targeting in tail campaigns

Proceedings of the 20th international conference companion on World wide web
Mahout in Action

Mahout in Action
Towards a unified architecture for in-RDBMS analytics

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
The MADlib analytics library: or MAD skills, the SQL

Proceedings of the VLDB Endowment
Distributed Big Advertiser Data Mining

ICDMW '12 Proceedings of the 2012 IEEE 12th International Conference on Data Mining Workshops

Quantified Score

Hi-index	0.00

Visualization

Abstract

Online display advertisers extensively use the concept of a user segment to cluster users into targetable groups. When the sizes of such segments are less than the desired value for campaign budgets, there is a need to use probabilistic modeling to expand the size. This process is termed look-alike modeling. Given the multitude of data providers and on-line data sources, there are thousands of segments for each targetable consumer extracted from billions of online (even offline) actions performed by millions of users. The majority of advertisers, marketers and publishers have to use large scale distributed infrastructures to create thousands of user segments on a daily basis. Developing accurate data mining models efficiently within such platforms is a challenging task. The volume and variety of data can be a significant bottleneck for non-disk resident algorithms, since operating time for training and scoring hundreds of segments with millions of targetable users is non-trivial. In this paper, we present a novel k-means based distributed in-database algorithm for look-alike modeling implemented within the nPario database system. We demonstrate the utility of the algorithm: accurate, invariant of size and skew of the targetable audience(very few positive examples), and dependent linearly on the capacity and number of nodes in the distributed environment. To the best of our knowledge this is the first ever commercially deployed distributed look-alike modeling implementation to solve this problem. We compare the performance of our algorithm with other distributed and non-distributed look-alike modeling techniques, and report the results over a multi-core environment.