Scalable k-means++

Authors:
Bahman Bahmani;Benjamin Moseley;Andrea Vattani;Ravi Kumar;Sergei Vassilvitskii
Affiliations:
Stanford University, Stanford, CA;University of Illinois, Urbana, IL;University of California, San Diego, CA;Yahoo! Research, Sunnyvale, CA;Yahoo! Research, New York, NY
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 29
Cited 4

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Data clustering: a review

ACM Computing Surveys (CSUR)
Scalability for clustering algorithms revisited

ACM SIGKDD Explorations Newsletter
Two-phase clustering process for outliers detection

Pattern Recognition Letters
A Data-Clustering Algorithm on Distributed Memory Multiprocessors

Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD
Clustering Data Streams: Theory and Practice

IEEE Transactions on Knowledge and Data Engineering
Efficient Disk-Based K-Means Clustering for Relational Databases

IEEE Transactions on Knowledge and Data Engineering
A local search approximation algorithm for k-means clustering

Computational Geometry: Theory and Applications - Special issue on the 18th annual symposium on computational geometry—SoCG2002
A Simple Linear Time (1+ ") -Approximation Algorithm for k-Means Clustering in Any Dimensions

FOCS '04 Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science
Research issues in automatic database clustering

ACM SIGMOD Record
Integrating K-Means Clustering with a Relational DBMS Using SQL

IEEE Transactions on Knowledge and Data Engineering
How slow is the k-means method?

Proceedings of the twenty-second annual symposium on Computational geometry
The Effectiveness of Lloyd-Type Methods for the k-Means Problem

FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
Google news personalization: scalable online collaborative filtering

Proceedings of the 16th international conference on World Wide Web
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Top 10 algorithms in data mining

Knowledge and Information Systems
NP-hardness of Euclidean sum-of-squares clustering

Machine Learning
Parallel K-Means Clustering Based on MapReduce

CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
Max-cover in map-reduce

Proceedings of the 19th international conference on World wide web
Web-scale k-means clustering

Proceedings of the 19th international conference on World wide web
A model of computation for MapReduce

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
k-means Requires Exponentially Many Iterations Even in the Plane

Discrete & Computational Geometry - Special Issue: 25th Annual Symposium on Computational Geometry; Guest Editor: John Hershberger
Fast personalized PageRank on MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Filtering: a method for solving graph problems in MapReduce

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Fast clustering using MapReduce

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Densest subgraph in streaming and MapReduce

Proceedings of the VLDB Endowment

Evaluating the use of clustering for automatically organising digital library collections

TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
A fast algorithm for clustering with mapreduce

ISNN'13 Proceedings of the 10th international conference on Advances in Neural Networks - Volume Part I
Accuracy-based classification EM: combining clustering with prediction

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
Scalable K-Means by ranked retrieval

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means|| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on real-world large-scale data demonstrates that k-means|| outperforms k-means++ in both sequential and parallel settings.