Randomized algorithms
A constant-factor approximation algorithm for the k-median problem (extended abstract)
STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Sublinear time algorithms for metric space problems
STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Sublinear time approximate clustering
SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
Learning mixtures of arbitrary gaussians
STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
A new greedy approach for facility location problems
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
SIAM Journal on Computing
Improved Combinatorial Algorithms for the Facility Location and k-Median Problems
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Learning Mixtures of Gaussians
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
Approximation algorithms for np -hard clustering problems
Approximation algorithms for np -hard clustering problems
A fast k-means implementation using coresets
Proceedings of the twenty-second annual symposium on Computational geometry
Approximation algorithms for hierarchical location problems
Journal of Computer and System Sciences - Special issue on network algorithms 2005
A PTAS for k-means clustering based on weak coresets
SCG '07 Proceedings of the twenty-third annual symposium on Computational geometry
Smooth sensitivity and sampling in private data analysis
Proceedings of the thirty-ninth annual ACM symposium on Theory of computing
Clustering for metric and non-metric distance measures
Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Adaptive Sampling for k-Means Clustering
APPROX '09 / RANDOM '09 Proceedings of the 12th International Workshop and 13th International Workshop on Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques
A sublinear-time approximation scheme for bin packing
Theoretical Computer Science
Small space representations for metric min-sum k-clustering and their applications
STACS'07 Proceedings of the 24th annual conference on Theoretical aspects of computer science
Scalable Clustering for Mining Local-Correlated Clusters in High Dimensions and Large Datasets
Fundamenta Informaticae - Intelligent Data Analysis in Granular Computing
Clustering for metric and nonmetric distance measures
ACM Transactions on Algorithms (TALG)
Approximation algorithms for k-modes clustering
ICIC'06 Proceedings of the 2006 international conference on Intelligent computing: Part II
Property testing
Property testing
Speeding-Up hierarchical agglomerative clustering in presence of expensive metrics
PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Parallel probabilistic tree embeddings, k-median, and buy-at-bulk network design
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
The effectiveness of lloyd-type methods for the k-means problem
Journal of the ACM (JACM)
Deterministic sublinear-time approximations for metric 1-median selection
Information Processing Letters
Hi-index | 0.00 |
Clustering is a fundamental problem in unsupervised learning, and has been studied widely both as a problem of learning mixture models and as an optimization problem. In this paper, we study clustering with respect to the k-median objective function, a natural formulation of clustering in which we attempt to minimize the average distance to cluster centers. One of the main contributions of this paper is a simple but powerful sampling technique that we call successive sampling that could be of independent interest. We show that our sampling procedure can rapidly identify a small set of points (of size just O(k log \frac{n}{k})) that summarize the input points for the purpose of clustering. Using successive sampling, we develop an algorithm for the k-median problem that runs in O(nk) time for a wide range of values of k and is guaranteed, with high probability, to return a solution with cost at most a constant factor times optimal. We also establish a lower bound of Ω(nk) on any randomized constant-factor approximation algorithm for the k-median problem that succeeds with even a negligible (say \frac{1}{100}) probability. The best previous upper bound for the problem was Õ(nk), where the Õ-notation hides polylogarithmic factors in n and k. The best previous lower bound of Ω(nk) applied only to deterministic k-median algorithms. While we focus our presentation on the k-median objective, all our upper bounds are valid for the k-means objective as well. In this context our algorithm compares favorably to the widely used k-means heuristic, which requires O(nk) time for just one iteration and provides no useful approximation guarantees.