Stability Yields a PTAS for k-Median and k-Means Clustering

  • Authors:
  • Pranjal Awasthi;Avrim Blum;Or Sheffet

  • Affiliations:
  • -;-;-

  • Venue:
  • FOCS '10 Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We consider $k$-median clustering in finite metric spaces and $k$-means clustering in Euclidean spaces, in the setting where $k$ is part of the input (not a constant). For the $k$-means problem, Ostrovsky et al. show that if the optimal $(k-1)$-means clustering of the input is more expensive than the optimal $k$-means clustering by a factor of $1/\epsilon^2$, then one can achieve a $(1+f(\epsilon))$-approximation to the $k$-means optimal in time polynomial in $n$ and $k$ by using a variant of Lloyd's algorithm. In this work we substantially improve this approximation guarantee. We show that given only the condition that the $(k-1)$-means optimal is more expensive than the $k$-means optimal by a factor $1+\alpha$ for {\em some} constant $\alpha0$, we can obtain a PTAS. In particular, under this assumption, for any $\eps0$ we achieve a $(1+\eps)$-approximation to the $k$-means optimal in time polynomial in $n$ and $k$, and exponential in $1/\eps$ and $1/\alpha$. We thus decouple the strength of the assumption from the quality of the approximation ratio. We also give a PTAS for the $k$-median problem in finite metrics under the analogous assumption as well. For $k$-means, we in addition give a randomized algorithm with improved running time of $n^{O(1)}(k \log n)^{\poly(1/\epsilon,1/\alpha)}$. Our technique also obtains a PTAS under the assumption of Balcan et al. that all $(1+\alpha)$ approximations are $\delta$-close to a desired target clustering, in the case that all target clusters have size greater than $\delta n$ and $\alpha0$ is constant. Note that the motivation of Balcan et al. is that for many clustering problems, the objective function is only a proxy for the true goal of getting close to the target. From this perspective, our improvement is that for $k$-means in Euclidean spaces we reduce the distance of the clustering found to the target from $O(\delta)$ to $\delta$ when all target clusters are large, and for $k$-median we improve the ``largeness'' condition needed in the work of Balcan et al. to get exactly $\delta$-close from $O(\delta n)$ to $\delta n$. Our results are based on a new notion of clustering stability.