Streaming k-means on well-clusterable data

Authors:
Vladimir Braverman;Adam Meyerson;Rafail Ostrovsky;Alan Roytman;Michael Shindler;Brian Tagiku
Affiliations:
UCLA;UCLA;UCLA;UCLA;UCLA;UCLA
Venue:
Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Year:
2011

Citing 22
Cited 4

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Vector quantization and signal compression

Vector quantization and signal compression
Accelerating exact k-means algorithms with geometric reasoning

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data clustering: a review

ACM Computing Surveys (CSUR)
Local search heuristic for k-median and facility location problems

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Approximate clustering via core-sets

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
A local search approximation algorithm for k-means clustering

Proceedings of the eighteenth annual symposium on Computational geometry
Acceleration of K-Means and Related Clustering Algorithms

ALENEX '02 Revised Papers from the 4th International Workshop on Algorithm Engineering and Experiments
Clustering Data Streams: Theory and Practice

IEEE Transactions on Knowledge and Data Engineering
Better streaming algorithms for clustering problems

Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
Clustering data streams

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Online Facility Location

FOCS '01 Proceedings of the 42nd IEEE symposium on Foundations of Computer Science
On coresets for k-means and k-median clustering

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Coresets in dynamic geometric data streams

Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
How slow is the k-means method?

Proceedings of the twenty-second annual symposium on Computational geometry
The Effectiveness of Lloyd-Type Methods for the k-Means Problem

FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
A PTAS for k-means clustering based on weak coresets

SCG '07 Proceedings of the twenty-third annual symposium on Computational geometry
k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Data analysis in the social sciences: what about the details?

AFIPS '65 (Fall, part I) Proceedings of the November 30--December 1, 1965, fall joint computer conference, part I
Approximate clustering without the approximation

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
On Coresets for $k$-Median and $k$-Means Clustering in Metric and Euclidean Spaces and Their Applications

SIAM Journal on Computing
Quantization

IEEE Transactions on Information Theory

Memoryless facility location in one pass

ACM Transactions on Algorithms (TALG)
Measuring the impact of sense similarity on word sense induction

EMNLP '11 Proceedings of the First Workshop on Unsupervised Learning in NLP
Deterministic sublinear-time approximations for metric 1-median selection

Information Processing Letters
Scalable K-Means by ranked retrieval

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the central problems in data-analysis is k-means clustering. In recent years, considerable attention in the literature addressed the streaming variant of this problem, culminating in a series of results (Har-Peled and Mazumdar; Frahling and Sohler; Frahling, Monemizadeh, and Sohler; Chen) that produced a (1 + ε)-approximation for k-means clustering in the streaming setting. Unfortunately, since optimizing the k-means objective is Max-SNP hard, all algorithms that achieve a (1 + ε)-approximation must take time exponential in k unless P=NP. Thus, to avoid exponential dependence on k, some additional assumptions must be made to guarantee high quality approximation and polynomial running time. A recent paper of Ostrovsky, Rabani, Schulman, and Swamy (FOCS 2006) introduced the very natural assumption of data separability: the assumption closely reflects how k-means is used in practice and allowed the authors to create a high-quality approximation for k-means clustering in the non-streaming setting with polynomial running time even for large values of k. Their work left open a natural and important question: are similar results possible in a streaming setting? This is the question we answer in this paper, albeit using substantially different techniques. We show a near-optimal streaming approximation algorithm for k-means in high-dimensional Euclidean space with sublinear memory and a single pass, under the same data separability assumption. Our algorithm offers significant improvements in both space and running time over previous work while yielding asymptotically best-possible performance (assuming that the running time must be fully polynomial and P ≠ NP). The novel techniques we develop along the way imply a number of additional results: we provide a high-probability performance guarantee for online facility location (in contrast, Meyerson's FOCS 2001 algorithm gave bounds only in expectation); we develop a constant approximation method for the general class of semi-metric clustering problems; we improve (even without σ-separability) by a logarithmic factor space requirements for streaming constant-approximation for k-median; finally we design a "re-sampling method" in a streaming setting to convert any constant approximation for clustering to a [1 + O(σ2)]-approximation for σ-separable data.