Scalable K-Means by ranked retrieval

Authors:
Andrei Broder;Lluis Garcia-Pueyo;Vanja Josifovski;Sergei Vassilvitskii;Srihari Venkatesan
Affiliations:
Google, Mountain View, CA, USA;Google, Mountain View, CA, USA;Google, Mountain View, CA, USA;Google, Mountain View, CA, USA;xAd, Sunnyvale, CA, USA
Venue:
Proceedings of the 7th ACM international conference on Web search and data mining
Year:
2014

Citing 26
Cited 0

Query evaluation: strategies and optimizations

Information Processing and Management: an International Journal
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
Accelerating exact k-means algorithms with geometric reasoning

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Modern Information Retrieval

Modern Information Retrieval
An Efficient k-Means Clustering Algorithm: Analysis and Implementation

IEEE Transactions on Pattern Analysis and Machine Intelligence
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Better streaming algorithms for clustering problems

Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
Clustering data streams

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Online Facility Location

FOCS '01 Proceedings of the 42nd IEEE symposium on Foundations of Computer Science
Efficient query evaluation using a two-level retrieval process

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Subspace clustering for high dimensional data: a review

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A local search approximation algorithm for k-means clustering

Computational Geometry: Theory and Applications - Special issue on the 18th annual symposium on computational geometry—SoCG2002
Inverted files for text search engines

ACM Computing Surveys (CSUR)
How slow is the k-means method?

Proceedings of the twenty-second annual symposium on Computational geometry
Fast and exact out-of-core and distributed k-means clustering

Knowledge and Information Systems
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Robust classification of rare queries using web knowledge

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Web-scale k-means clustering

Proceedings of the 19th international conference on World wide web
k-means Requires Exponentially Many Iterations Even in the Plane

Discrete & Computational Geometry - Special Issue: 25th Annual Symposium on Computational Geometry; Guest Editor: John Hershberger
Smoothed Analysis of the k-Means Method

Journal of the ACM (JACM)
Space-Limited ranked query evaluation using adaptive pruning

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Streaming k-means on well-clusterable data

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Scalable k-means++

Proceedings of the VLDB Endowment
Least squares quantization in PCM

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

The k-means clustering algorithm has a long history and a proven practical performance, however it does not scale to clustering millions of data points into thousands of clusters in high dimensional spaces. The main computational bottleneck is the need to recompute the nearest centroid for every data point at every iteration, aprohibitive cost when the number of clusters is large. In this paper we show how to reduce the cost of the k-means algorithm by large factors by adapting ranked retrieval techniques. Using a combination of heuristics, on two real life data sets the wall clock time per iteration is reduced from 445 minutes to less than 4, and from 705 minutes to 1.4, while the clustering quality remains within 0.5% of the k-means quality. The key insight is to invert the process of point-to-centroid assignment by creating an inverted index over all the points and then using the current centroids as queries to this index to decide on cluster membership. In other words, rather than each iteration consisting of "points picking centroids", each iteration now consists of "centroids picking points". This is much more efficient, but comes at the cost of leaving some points unassigned to any centroid. We show experimentally that the number of such points is low and thus they can be separately assigned once the final centroids are decided. To speed up the computation we sparsify the centroids by pruning low weight features. Finally, to further reduce the running time and the number of unassigned points, we propose a variant of the WAND algorithm that uses the results of the intermediate results of nearest neighbor computations to improve performance.