On Coresets for $k$-Median and $k$-Means Clustering in Metric and Euclidean Spaces and Their Applications

  • Authors:
  • Ke Chen

  • Affiliations:
  • kechen@engineering.uiuc.edu

  • Venue:
  • SIAM Journal on Computing
  • Year:
  • 2009

Quantified Score

Hi-index 0.01

Visualization

Abstract

We present new approximation algorithms for the $k$-median and $k$-means clustering problems. To this end, we obtain small coresets for $k$-median and $k$-means clustering in general metric spaces and in Euclidean spaces. In $\mathbb{R}^d$, these coresets are of size with polynomial dependency on the dimension $d$. This leads to $(1+\varepsilon)$-approximation algorithms to the optimal $k$-median and $k$-means clustering in $\mathbb{R}^d$, with running time $O(ndk+2^{(k/\varepsilon)^{O(1)}}d^2\log^{k+2}n)$, where $n$ is the number of points. This improves over previous results. We use those coresets to maintain a $(1+\varepsilon)$-approximate $k$-median and $k$-means clustering of a stream of points in $\mathbb{R}^d$, using $O(d^2k^2\varepsilon^{-2}\log^8n)$ space. These are the first streaming algorithms, for those problems, that have space complexity with polynomial dependency on the dimension.