K-means: algorithms, analyses, experiments

Authors:
Rajeev Motwani;Sergei Vassilvitskii
Affiliations:
Stanford University;Stanford University
Venue:
K-means: algorithms, analyses, experiments
Year:
2007

Citing 0
Cited 1

Stratified sampling of execution traces: Execution phases serving as strata

Science of Computer Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

A half century after its initial introduction, the k-means method remains a widely used clustering technique. Although it offers no performance guarantees on the quality of the clustering, its simplicity and observed speed are appealing to practitioners. It is employed to cluster all kinds of data, from text documents to gene sequences. Despite the popularity of k-means, little was known about its worst case performance, both in terms of the running time of the algorithm and the accuracy of the clustering. We begin by showing that the worst-case convergence time of k-means may be exponential. We also provide a theoretical explanation for why this behavior is not readily observed in practice, by considering the worst-case running time of k-means after randomly perturbing each data point. Known as smoothed complexity, this approach models the noise often present in data-driven scenarios. We prove that the smoothed complexity of k-means is polynomial in the number of points. As with all local search methods, the initial solution point of k-means plays a large role in the quality of the final result. The usual randomized initialization procedure for k-means often performs poorly, and many other approaches have been proposed. The vast majority of these approaches offer no worst-case guarantees, and the selection of the "right" routine remains a black art. Moreover, the algorithms that do offer analytical guarantees have unattractive running times. We introduce k-means++: a simple, linear-time randomized initialization algorithm and prove tight analytical bounds on its performance. We implement the k-means++ algorithm, and evaluate the quality of the clustering produced on a variety of different datasets. We then compare the performance of k-means++ to other popular initialization functions. Our experiments show that k-means++ improves both the speed and the accuracy of k-means, often quite dramatically.