Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
Vector quantization and signal compression
Vector quantization and signal compression
Accelerating exact k-means algorithms with geometric reasoning
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
ACM Computing Surveys (CSUR)
Local search heuristic for k-median and facility location problems
STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Approximate clustering via core-sets
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
A local search approximation algorithm for k-means clustering
Proceedings of the eighteenth annual symposium on Computational geometry
Acceleration of K-Means and Related Clustering Algorithms
ALENEX '02 Revised Papers from the 4th International Workshop on Algorithm Engineering and Experiments
Clustering Data Streams: Theory and Practice
IEEE Transactions on Knowledge and Data Engineering
Better streaming algorithms for clustering problems
Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
FOCS '01 Proceedings of the 42nd IEEE symposium on Foundations of Computer Science
On coresets for k-means and k-median clustering
STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Coresets in dynamic geometric data streams
Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
How slow is the k-means method?
Proceedings of the twenty-second annual symposium on Computational geometry
The Effectiveness of Lloyd-Type Methods for the k-Means Problem
FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
A PTAS for k-means clustering based on weak coresets
SCG '07 Proceedings of the twenty-third annual symposium on Computational geometry
k-means++: the advantages of careful seeding
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Data analysis in the social sciences: what about the details?
AFIPS '65 (Fall, part I) Proceedings of the November 30--December 1, 1965, fall joint computer conference, part I
Approximate clustering without the approximation
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
IEEE Transactions on Information Theory
Memoryless facility location in one pass
ACM Transactions on Algorithms (TALG)
Measuring the impact of sense similarity on word sense induction
EMNLP '11 Proceedings of the First Workshop on Unsupervised Learning in NLP
Deterministic sublinear-time approximations for metric 1-median selection
Information Processing Letters
Scalable K-Means by ranked retrieval
Proceedings of the 7th ACM international conference on Web search and data mining
Hi-index | 0.00 |
One of the central problems in data-analysis is k-means clustering. In recent years, considerable attention in the literature addressed the streaming variant of this problem, culminating in a series of results (Har-Peled and Mazumdar; Frahling and Sohler; Frahling, Monemizadeh, and Sohler; Chen) that produced a (1 + ε)-approximation for k-means clustering in the streaming setting. Unfortunately, since optimizing the k-means objective is Max-SNP hard, all algorithms that achieve a (1 + ε)-approximation must take time exponential in k unless P=NP. Thus, to avoid exponential dependence on k, some additional assumptions must be made to guarantee high quality approximation and polynomial running time. A recent paper of Ostrovsky, Rabani, Schulman, and Swamy (FOCS 2006) introduced the very natural assumption of data separability: the assumption closely reflects how k-means is used in practice and allowed the authors to create a high-quality approximation for k-means clustering in the non-streaming setting with polynomial running time even for large values of k. Their work left open a natural and important question: are similar results possible in a streaming setting? This is the question we answer in this paper, albeit using substantially different techniques. We show a near-optimal streaming approximation algorithm for k-means in high-dimensional Euclidean space with sublinear memory and a single pass, under the same data separability assumption. Our algorithm offers significant improvements in both space and running time over previous work while yielding asymptotically best-possible performance (assuming that the running time must be fully polynomial and P ≠ NP). The novel techniques we develop along the way imply a number of additional results: we provide a high-probability performance guarantee for online facility location (in contrast, Meyerson's FOCS 2001 algorithm gave bounds only in expectation); we develop a constant approximation method for the general class of semi-metric clustering problems; we improve (even without σ-separability) by a logarithmic factor space requirements for streaming constant-approximation for k-median; finally we design a "re-sampling method" in a streaming setting to convert any constant approximation for clustering to a [1 + O(σ2)]-approximation for σ-separable data.