Randomized algorithms
A constant-factor approximation algorithm for the k-median problem (extended abstract)
STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Sublinear time algorithms for metric space problems
STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Sublinear time approximate clustering
SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
Learning mixtures of arbitrary gaussians
STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
A new greedy approach for facility location problems
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Quick k-Median, k-Center, and Facility Location for Sparse Graphs
ICALP '01 Proceedings of the 28th International Colloquium on Automata, Languages and Programming,
Improved Combinatorial Algorithms for the Facility Location and k-Median Problems
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Learning Mixtures of Gaussians
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Approximation algorithms for np -hard clustering problems
Approximation algorithms for np -hard clustering problems
Clustering Data Streams: Theory and Practice
IEEE Transactions on Knowledge and Data Engineering
Better streaming algorithms for clustering problems
Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
Approximation algorithms for hierarchical location problems
Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
On coresets for k-means and k-median clustering
STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
A k-Median Algorithm with Running Time Independent of Data Size
Machine Learning
A local search approximation algorithm for k-means clustering
Computational Geometry: Theory and Applications - Special issue on the 18th annual symposium on computational geometrySoCG2002
Smaller coresets for k-median and k-means clustering
SCG '05 Proceedings of the twenty-first annual symposium on Computational geometry
On k-Median clustering in high dimensions
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
k-means++: the advantages of careful seeding
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Hi-index | 0.00 |
Clustering is a fundamental problem in unsupervised learning, and has been studied widely both as a problem of learning mixture models and as an optimization problem. In this paper, we study clustering with respect the k-median objective function, a natural formulation of clustering in which we attempt to minimize the average distance to cluster centers. One of the main contributions of this paper is a simple but powerful sampling technique that we call successive sampling that could be of independent interest. We show that our sampling procedure can rapidly identify a small set of points (of size just O(klogn/k)) that summarize the input points for the purpose of clustering. Using successive sampling, we develop an algorithm for the k-median problem that runs in O(nk) time for a wide range of values of k and is guaranteed, with high probability, to return a solution with cost at most a constant factor times optimal. We also establish a lower bound of Ω(nk) on any randomized constant-factor approximation algorithm for the k-median problem that succeeds with even a negligible (say 1/100) probability. The best previous upper bound for the problem was Õ(nk), where the Õ-notation hides polylogarithmic factors in n and k. The best previous lower bound of Ω(nk) applied only to deterministic k-median algorithms. While we focus our presentation on the k-median objective, all our upper bounds are valid for the k-means objective as well. In this context our algorithm compares favorably to the widely used k-means heuristic, which requires O(nk) time for just one iteration and provides no useful approximation guarantees.