Overcoming browser cookie churn with clustering

Authors:
Anirban Dasgupta;Maxim Gurevich;Liang Zhang;Belle Tseng;Achint O. Thomas
Affiliations:
Yahoo! Research, Santa Clara, CA, USA;Yahoo! Research, Santa Clara, CA, USA;Yahoo! Research, Santa Clara, CA, USA;Yahoo! Research, Santa Clara, CA, USA;Yahoo! Research, Santa Clara, CA, USA
Venue:
Proceedings of the fifth ACM international conference on Web search and data mining
Year:
2012

Citing 19
Cited 1

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Combining collaborative filtering with personal agents for better recommendations

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
On Clustering Validation Techniques

Journal of Intelligent Information Systems
Support Vector Machines

IEEE Intelligent Systems
Constrained K-means Clustering with Background Knowledge

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Entity-based cross-document coreferencing using the Vector Space Model

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Algorithmic Graph Theory and Perfect Graphs (Annals of Discrete Mathematics, Vol 57)

Algorithmic Graph Theory and Perfect Graphs (Annals of Discrete Mathematics, Vol 57)
Correlation Clustering

Machine Learning
Probability and Computing: Randomized Algorithms and Probabilistic Analysis

Probability and Computing: Randomized Algorithms and Probabilistic Analysis
Naïve filterbots for robust cold-start recommendations

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Correlation clustering in general weighted graphs

Theoretical Computer Science - Approximation and online algorithms
A comparison of extrinsic clustering evaluation metrics based on formal constraints

Information Retrieval
Regression-based latent factor models

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Distance Metric Learning for Large Margin Nearest Neighbor Classification

The Journal of Machine Learning Research
How unique is your web browser?

PETS'10 Proceedings of the 10th international conference on Privacy enhancing technologies
Generalizing matrix factorization through flexible regression priors

Proceedings of the fifth ACM conference on Recommender systems

From devices to people: attribution of search activity in multi-user settings

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many large Internet websites are accessed by users anonymously, without requiring registration or logging-in. However, to provide personalized service these sites build anonymous, yet persistent, user models based on repeated user visits. Cookies, issued when a web browser first visits a site, are typically employed to anonymously associate a website visit with a distinct user (web browser). However, users may reset cookies, making such association short-lived and noisy. In this paper we propose a solution to the cookie churn problem: a novel algorithm for grouping similar cookies into clusters that are more persistent than individual cookies. Such clustering could potentially allow more robust estimation of the number of unique visitors of the site over a certain long time period, and also better user modeling which is key to plenty of web applications such as advertising and recommender systems. We present a novel method to cluster browser cookies into groups that are likely to belong to the same browser based on a statistical model of browser visitation patterns. We address each step of the clustering as a binary classification problem estimating the probability that two different subsets of cookies belong to the same browser. We observe that our clustering problem is a generalized interval graph coloring problem, and propose a greedy heuristic algorithm for solving it. The scalability of this method allows us to cluster hundreds of millions of browser cookies and provides significant improvements over baselines such as constrained K-means.