Query evaluation: strategies and optimizations
Information Processing and Management: an International Journal
Self-indexing inverted files for fast text retrieval
ACM Transactions on Information Systems (TOIS)
Accelerating exact k-means algorithms with geometric reasoning
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Modern Information Retrieval
An Efficient k-Means Clustering Algorithm: Analysis and Implementation
IEEE Transactions on Pattern Analysis and Machine Intelligence
X-means: Extending K-means with Efficient Estimation of the Number of Clusters
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Better streaming algorithms for clustering problems
Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
FOCS '01 Proceedings of the 42nd IEEE symposium on Foundations of Computer Science
Efficient query evaluation using a two-level retrieval process
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Subspace clustering for high dimensional data: a review
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
A local search approximation algorithm for k-means clustering
Computational Geometry: Theory and Applications - Special issue on the 18th annual symposium on computational geometrySoCG2002
Inverted files for text search engines
ACM Computing Surveys (CSUR)
How slow is the k-means method?
Proceedings of the twenty-second annual symposium on Computational geometry
Fast and exact out-of-core and distributed k-means clustering
Knowledge and Information Systems
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Robust classification of rare queries using web knowledge
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
k-means++: the advantages of careful seeding
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Proceedings of the 19th international conference on World wide web
k-means Requires Exponentially Many Iterations Even in the Plane
Discrete & Computational Geometry - Special Issue: 25th Annual Symposium on Computational Geometry; Guest Editor: John Hershberger
Smoothed Analysis of the k-Means Method
Journal of the ACM (JACM)
Space-Limited ranked query evaluation using adaptive pruning
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Streaming k-means on well-clusterable data
Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Proceedings of the VLDB Endowment
Least squares quantization in PCM
IEEE Transactions on Information Theory
Hi-index | 0.00 |
The k-means clustering algorithm has a long history and a proven practical performance, however it does not scale to clustering millions of data points into thousands of clusters in high dimensional spaces. The main computational bottleneck is the need to recompute the nearest centroid for every data point at every iteration, aprohibitive cost when the number of clusters is large. In this paper we show how to reduce the cost of the k-means algorithm by large factors by adapting ranked retrieval techniques. Using a combination of heuristics, on two real life data sets the wall clock time per iteration is reduced from 445 minutes to less than 4, and from 705 minutes to 1.4, while the clustering quality remains within 0.5% of the k-means quality. The key insight is to invert the process of point-to-centroid assignment by creating an inverted index over all the points and then using the current centroids as queries to this index to decide on cluster membership. In other words, rather than each iteration consisting of "points picking centroids", each iteration now consists of "centroids picking points". This is much more efficient, but comes at the cost of leaving some points unassigned to any centroid. We show experimentally that the number of such points is low and thus they can be separately assigned once the final centroids are decided. To speed up the computation we sparsify the centroids by pruning low weight features. Finally, to further reduce the running time and the number of unassigned points, we propose a variant of the WAND algorithm that uses the results of the intermediate results of nearest neighbor computations to improve performance.