On Coresets for $k$-Median and $k$-Means Clustering in Metric and Euclidean Spaces and Their Applications

Authors:
Ke Chen
Affiliations:
kechen@engineering.uiuc.edu
Venue:
SIAM Journal on Computing
Year:
2009

Citing 0
Cited 8

Adaptive Sampling for k-Means Clustering

APPROX '09 / RANDOM '09 Proceedings of the 12th International Workshop and 13th International Workshop on Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques
Clustering for metric and nonmetric distance measures

ACM Transactions on Algorithms (TALG)
A near-linear algorithm for projective clustering integer points

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Streaming k-means on well-clusterable data

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Bregman clustering for separable instances

SWAT'10 Proceedings of the 12th Scandinavian conference on Algorithm Theory
StreamKM++: A clustering algorithm for data streams

Journal of Experimental Algorithmics (JEA)
Algorithmic superactivation of asymptotic quantum capacity of zero-capacity quantum channels

Information Sciences: an International Journal
Deterministic sublinear-time approximations for metric 1-median selection

Information Processing Letters

Quantified Score

Hi-index	0.01

Visualization

Abstract

We present new approximation algorithms for the $k$-median and $k$-means clustering problems. To this end, we obtain small coresets for $k$-median and $k$-means clustering in general metric spaces and in Euclidean spaces. In $\mathbb{R}^d$, these coresets are of size with polynomial dependency on the dimension $d$. This leads to $(1+\varepsilon)$-approximation algorithms to the optimal $k$-median and $k$-means clustering in $\mathbb{R}^d$, with running time $O(ndk+2^{(k/\varepsilon)^{O(1)}}d^2\log^{k+2}n)$, where $n$ is the number of points. This improves over previous results. We use those coresets to maintain a $(1+\varepsilon)$-approximate $k$-median and $k$-means clustering of a stream of points in $\mathbb{R}^d$, using $O(d^2k^2\varepsilon^{-2}\log^8n)$ space. These are the first streaming algorithms, for those problems, that have space complexity with polynomial dependency on the dimension.