A k-Median Algorithm with Running Time Independent of Data Size

  • Authors:
  • Adam Meyerson;Liadan O'Callaghan;Serge Plotkin

  • Affiliations:
  • Department of Computer Science, University of California, Los Angeles. awm@cs.ucla.edu;Department of Computer Science, Stanford University. loc@cs.stanford.edu;Department of Computer Science, Stanford University. plotkin@theory.stanford.edu

  • Venue:
  • Machine Learning
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

We give a sampling-based algorithm for the k-Median problem, with running time O(k(\frac{k^2}{\epsilon} log k)2 log(\frac{k}{\epsilon} log k)), where k is the desired number of clusters and ε is a confidence parameter. This is the first k-Median algorithm with fully polynomial running time that is independent of n, the size of the data set. It gives a solution that is, with high probability, an O(1)-approximation, if each cluster in some optimal solution has Ω(\frac{n\epsilon}{k}) points. We also give weakly-polynomial-time algorithms for this problem and a relaxed version of k-Median in which a small fraction of outliers can be excluded. We give near-matching lower bounds showing that this assumption about cluster size is necessary. We also present a related algorithm for finding a clustering that excludes a small number of outliers.