Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
OPTICS: ordering points to identify the clustering structure
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Fast and effective text mining using linear-time document clustering
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Combining collaborative filtering with personal agents for better recommendations
AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
On Clustering Validation Techniques
Journal of Intelligent Information Systems
IEEE Intelligent Systems
Constrained K-means Clustering with Background Knowledge
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Entity-based cross-document coreferencing using the Vector Space Model
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Algorithmic Graph Theory and Perfect Graphs (Annals of Discrete Mathematics, Vol 57)
Algorithmic Graph Theory and Perfect Graphs (Annals of Discrete Mathematics, Vol 57)
Machine Learning
Probability and Computing: Randomized Algorithms and Probabilistic Analysis
Probability and Computing: Randomized Algorithms and Probabilistic Analysis
Naïve filterbots for robust cold-start recommendations
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Correlation clustering in general weighted graphs
Theoretical Computer Science - Approximation and online algorithms
A comparison of extrinsic clustering evaluation metrics based on formal constraints
Information Retrieval
Regression-based latent factor models
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Distance Metric Learning for Large Margin Nearest Neighbor Classification
The Journal of Machine Learning Research
How unique is your web browser?
PETS'10 Proceedings of the 10th international conference on Privacy enhancing technologies
Generalizing matrix factorization through flexible regression priors
Proceedings of the fifth ACM conference on Recommender systems
From devices to people: attribution of search activity in multi-user settings
Proceedings of the 23rd international conference on World wide web
Hi-index | 0.00 |
Many large Internet websites are accessed by users anonymously, without requiring registration or logging-in. However, to provide personalized service these sites build anonymous, yet persistent, user models based on repeated user visits. Cookies, issued when a web browser first visits a site, are typically employed to anonymously associate a website visit with a distinct user (web browser). However, users may reset cookies, making such association short-lived and noisy. In this paper we propose a solution to the cookie churn problem: a novel algorithm for grouping similar cookies into clusters that are more persistent than individual cookies. Such clustering could potentially allow more robust estimation of the number of unique visitors of the site over a certain long time period, and also better user modeling which is key to plenty of web applications such as advertising and recommender systems. We present a novel method to cluster browser cookies into groups that are likely to belong to the same browser based on a statistical model of browser visitation patterns. We address each step of the clustering as a binary classification problem estimating the probability that two different subsets of cookies belong to the same browser. We observe that our clustering problem is a generalized interval graph coloring problem, and propose a greedy heuristic algorithm for solving it. The scalability of this method allows us to cluster hundreds of millions of browser cookies and provides significant improvements over baselines such as constrained K-means.