Small space representations for metric min-sum k-clustering and their applications

Authors:
Artur Czumaj;Christian Sohler
Affiliations:
Department of Computer Science, University of Warwick, Coventry, U.K.;Heinz Nixdorf Institute and Department of Computer Science, University of Paderborn, Paderborn, Germany
Venue:
STACS'07 Proceedings of the 24th annual conference on Theoretical aspects of computer science
Year:
2007

Citing 24
Cited 2

Incremental clustering and dynamic information retrieval

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Geometric algorithms for the minimum cost assignment problem

Random Structures & Algorithms
Sublinear time algorithms for metric space problems

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Approximation algorithms for min-sum p-clustering

Discrete Applied Mathematics
P-Complete Approximation Problems

Journal of the ACM (JACM)
Clustering for edge-cost minimization (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Approximating min-sum k-clustering in metric spaces

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Approximate clustering via core-sets

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Better streaming algorithms for clustering problems

Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
Approximation schemes for clustering problems

Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
A Sublinear Time Approximation Scheme for Clustering in Metric Spaces

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Clustering data streams

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
High-dimensional computational geometry

High-dimensional computational geometry
On coresets for k-means and k-median clustering

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Algorithms for dynamic geometric problems over data streams

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Optimal Time Bounds for Approximate Clustering

Machine Learning
A k-Median Algorithm with Running Time Independent of Data Size

Machine Learning
A Simple Linear Time (1+ ") -Approximation Algorithm for k-Means Clustering in Any Dimensions

FOCS '04 Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science
Quick k-Median, k-Center, and Facility Location for Sparse Graphs

SIAM Journal on Computing
Coresets in dynamic geometric data streams

Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
Smaller coresets for k-median and k-means clustering

SCG '05 Proceedings of the twenty-first annual symposium on Computational geometry
On k-Median clustering in high dimensions

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Facility location in sublinear time

ICALP'05 Proceedings of the 32nd international conference on Automata, Languages and Programming
Linear time algorithms for clustering problems in any dimensions

ICALP'05 Proceedings of the 32nd international conference on Automata, Languages and Programming

Streaming Embeddings with Slack

WADS '09 Proceedings of the 11th International Symposium on Algorithms and Data Structures
Clustering under approximation stability

Journal of the ACM (JACM)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The min-sum k-clustering problem is to partition a metric space (P, d) into k clusters C1, . . . , Ck ⊆ P such that Σi=1k Σ p,q∈Ci d(p,q) is minimized. We show the first efficient construction of a coreset for this problem. Our coreset construction is based on a new adaptive sampling algorithm. Using our coresets we obtain three main algorithmic results. The first result is a sublinear time (4+Ɛ)-approximation algorithm for the min-sum k-clustering problem in metric spaces. The running time of this algorithm is Õ(n) for any constant k and Ɛ, and it is o(n2) for all k = o(log n/ log log n). Since the description size of the input is Θ(n2), this is sublinear in the input size. Our second result is the first pass-efficient data streaming algorithm for min-sum k-clustering in the distance oracle model, i.e., an algorithm that uses poly(log n, k) space and makes 2 passes over the input point set arriving as a data stream. Our third result is a sublinear-time polylogarithmic-factor approximation algorithm for the min-sum k-clustering problem for arbitrary values of k. To develop the coresets, we introduce the concept of a-preserving metric embeddings. Such an embedding satisfies properties that (a) the distance between any pair of points does not decrease, and (b) the cost of an optimal solution for the considered problem on input (P, d′) is within a constant factor of the optimal solution on input (P, d). In other words, the idea is find a metric embedding into a (structurally simpler) metric space that approximates the original metric up to a factor of a with respect to a certain problem. We believe that this concept is an interesting generalization of coresets.