Clustering with or without the approximation

Authors:
Frans Schalekamp;Michael Yu;Anke van Zuylen
Affiliations:
ITCS, Tsinghua University;MIT;ITCS, Tsinghua University
Venue:
COCOON'10 Proceedings of the 16th annual international conference on Computing and combinatorics
Year:
2010

Citing 9
Cited 4

Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation

Journal of the ACM (JACM)
Greedy facility location algorithms analyzed using dual fitting with factor-revealing LP

Journal of the ACM (JACM)
Local Search Heuristics for k-Median and Facility Location Problems

SIAM Journal on Computing
The Effectiveness of Lloyd-Type Methods for the k-Means Problem

FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
A discriminative framework for clustering via similarity functions

STOC '08 Proceedings of the fortieth annual ACM symposium on Theory of computing
Approximate clustering without the approximation

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Agnostic clustering

ALT'09 Proceedings of the 20th international conference on Algorithmic learning theory
Clustering with or without the approximation

COCOON'10 Proceedings of the 16th annual international conference on Computing and combinatorics

Clustering with or without the approximation

COCOON'10 Proceedings of the 16th annual international conference on Computing and combinatorics
Active clustering of biological sequences

The Journal of Machine Learning Research
Data stability in clustering: a closer look

ALT'12 Proceedings of the 23rd international conference on Algorithmic Learning Theory
Clustering under approximation stability

Journal of the ACM (JACM)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study algorithms for clustering data that were recently proposed by Balcan, Blum and Gupta in SODA'09 [4] and that have already given rise to two follow-up papers. The input for the clustering problem consists of points in a metric space and a number k, specifying the desired number of clusters. The algorithms find a clustering that is provably close to a target clustering, provided that the instance has the "(1+α, ε)-property", which means that the instance is such that all solutions to the k-median problem for which the objective value is at most (1 + α) times the optimal objective value correspond to clusterings that misclassify at most an e fraction of the points with respect to the target clustering. We investigate the theoretical and practical implications of their results. Our main contributions are as follows. First, we show that instances that have the (1+α, ε)-property and for which, additionally, the clusters in the target clustering are large, are easier than general instances: the algorithm proposed in [4] is a constant factor approximation algorithm with an approximation guarantee that is better than the known hardness of approximation for general instances. Further, we show that it is NP hard to check if an instance satisfies the (1 +α, ε)-property for a given (α, ε); the algorithms in [4] need such α and ε as input parameters, however. We propose ways to use their algorithms even if we do not know values of a and e for which the assumption holds. Finally, we implement these methods and other popular methods, and test them on real world data sets. We find that on these data sets there are no a and e so that the dataset has both (1+α, ε)-property and sufficiently large clusters in the target solution. For the general case, we show that on our data sets the performance guarantee proved by [4] is meaningless for the values of α, ε such that the data set has the (1 + α, ε)-property. The algorithm nonetheless gives reasonable results, although it is outperformed by other methods.