Clustering for metric and nonmetric distance measures

  • Authors:
  • Marcel R. Ackermann;Johannes Blömer;Christian Sohler

  • Affiliations:
  • University of Paderborn, Paderborn, Germany;University of Paderborn, Paderborn, Germany;Technische Universität Dortmund, Dortmund, Germany

  • Venue:
  • ACM Transactions on Algorithms (TALG)
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We study a generalization of the k-median problem with respect to an arbitrary dissimilarity measure D. Given a finite set P of size n, our goal is to find a set C of size k such that the sum of errors D(P,C) = ∑p ∈ P minc ∈ C {D(p,c)} is minimized. The main result in this article can be stated as follows: There exists a (1+&epsis;)-approximation algorithm for the k-median problem with respect to D, if the 1-median problem can be approximated within a factor of (1+&epsis;) by taking a random sample of constant size and solving the 1-median problem on the sample exactly. This algorithm requires time n2O(mklog(mk/&epsis;)), where m is a constant that depends only on &epsis; and D. Using this characterization, we obtain the first linear time (1+&epsis;)-approximation algorithms for the k-median problem in an arbitrary metric space with bounded doubling dimension, for the Kullback-Leibler divergence (relative entropy), for the Itakura-Saito divergence, for Mahalanobis distances, and for some special cases of Bregman divergences. Moreover, we obtain previously known results for the Euclidean k-median problem and the Euclidean k-means problem in a simplified manner. Our results are based on a new analysis of an algorithm of Kumar et al. [2004].