Optimal Time Bounds for Approximate Clustering

  • Authors:
  • Ramgopal R. Mettu;C. Greg Plaxton

  • Affiliations:
  • Department of Computer Science, Dartmouth College, Hanover, NH 03755, USA. ramgopal@cs.dartmouth.edu;Department of Computer Science, University of Texas at Austin, Austin, TX 78712, USA. plaxton@cs.utexas.edu

  • Venue:
  • Machine Learning
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Clustering is a fundamental problem in unsupervised learning, and has been studied widely both as a problem of learning mixture models and as an optimization problem. In this paper, we study clustering with respect to the k-median objective function, a natural formulation of clustering in which we attempt to minimize the average distance to cluster centers. One of the main contributions of this paper is a simple but powerful sampling technique that we call successive sampling that could be of independent interest. We show that our sampling procedure can rapidly identify a small set of points (of size just O(k log \frac{n}{k})) that summarize the input points for the purpose of clustering. Using successive sampling, we develop an algorithm for the k-median problem that runs in O(nk) time for a wide range of values of k and is guaranteed, with high probability, to return a solution with cost at most a constant factor times optimal. We also establish a lower bound of Ω(nk) on any randomized constant-factor approximation algorithm for the k-median problem that succeeds with even a negligible (say \frac{1}{100}) probability. The best previous upper bound for the problem was Õ(nk), where the Õ-notation hides polylogarithmic factors in n and k. The best previous lower bound of Ω(nk) applied only to deterministic k-median algorithms. While we focus our presentation on the k-median objective, all our upper bounds are valid for the k-means objective as well. In this context our algorithm compares favorably to the widely used k-means heuristic, which requires O(nk) time for just one iteration and provides no useful approximation guarantees.