Optimal Time Bounds for Approximate Clustering

Authors:
Ramgopal R. Mettu;C. Greg Plaxton
Affiliations:
Department of Computer Science, Dartmouth College, Hanover, NH 03755, USA. ramgopal@cs.dartmouth.edu;Department of Computer Science, University of Texas at Austin, Austin, TX 78712, USA. plaxton@cs.utexas.edu
Venue:
Machine Learning
Year:
2004

Citing 13
Cited 18

Randomized algorithms

Randomized algorithms
A constant-factor approximation algorithm for the k-median problem (extended abstract)

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Sublinear time algorithms for metric space problems

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Sublinear time approximate clustering

SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
Learning mixtures of arbitrary gaussians

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
A new greedy approach for facility location problems

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
The Online Median Problem

SIAM Journal on Computing
Improved Combinatorial Algorithms for the Facility Location and k-Median Problems

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Learning Mixtures of Gaussians

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Clustering data streams

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Approximation algorithms for np -hard clustering problems

Approximation algorithms for np -hard clustering problems

A fast k-means implementation using coresets

Proceedings of the twenty-second annual symposium on Computational geometry
Approximation algorithms for hierarchical location problems

Journal of Computer and System Sciences - Special issue on network algorithms 2005
A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering

Machine Learning
A PTAS for k-means clustering based on weak coresets

SCG '07 Proceedings of the twenty-third annual symposium on Computational geometry
Smooth sensitivity and sampling in private data analysis

Proceedings of the thirty-ninth annual ACM symposium on Theory of computing
Clustering for metric and non-metric distance measures

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Adaptive Sampling for k-Means Clustering

APPROX '09 / RANDOM '09 Proceedings of the 12th International Workshop and 13th International Workshop on Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques
A sublinear-time approximation scheme for bin packing

Theoretical Computer Science
Small space representations for metric min-sum k-clustering and their applications

STACS'07 Proceedings of the 24th annual conference on Theoretical aspects of computer science
Scalable Clustering for Mining Local-Correlated Clusters in High Dimensions and Large Datasets

Fundamenta Informaticae - Intelligent Data Analysis in Granular Computing
Clustering for metric and nonmetric distance measures

ACM Transactions on Algorithms (TALG)
Approximation algorithms for k-modes clustering

ICIC'06 Proceedings of the 2006 international conference on Intelligent computing: Part II
Sublinear-time algorithms

Property testing
Sublinear-time algorithms

Property testing
Speeding-Up hierarchical agglomerative clustering in presence of expensive metrics

PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Parallel probabilistic tree embeddings, k-median, and buy-at-bulk network design

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
The effectiveness of lloyd-type methods for the k-means problem

Journal of the ACM (JACM)
Deterministic sublinear-time approximations for metric 1-median selection

Information Processing Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering is a fundamental problem in unsupervised learning, and has been studied widely both as a problem of learning mixture models and as an optimization problem. In this paper, we study clustering with respect to the k-median objective function, a natural formulation of clustering in which we attempt to minimize the average distance to cluster centers. One of the main contributions of this paper is a simple but powerful sampling technique that we call successive sampling that could be of independent interest. We show that our sampling procedure can rapidly identify a small set of points (of size just O(k log \frac{n}{k})) that summarize the input points for the purpose of clustering. Using successive sampling, we develop an algorithm for the k-median problem that runs in O(nk) time for a wide range of values of k and is guaranteed, with high probability, to return a solution with cost at most a constant factor times optimal. We also establish a lower bound of Ω(nk) on any randomized constant-factor approximation algorithm for the k-median problem that succeeds with even a negligible (say \frac{1}{100}) probability. The best previous upper bound for the problem was Õ(nk), where the Õ-notation hides polylogarithmic factors in n and k. The best previous lower bound of Ω(nk) applied only to deterministic k-median algorithms. While we focus our presentation on the k-median objective, all our upper bounds are valid for the k-means objective as well. In this context our algorithm compares favorably to the widely used k-means heuristic, which requires O(nk) time for just one iteration and provides no useful approximation guarantees.