StreamKM++: A clustering algorithm for data streams

Authors:
Marcel R. Ackermann;Marcus Märtens;Christoph Raupach;Kamil Swierkot;Christiane Lammersen;Christian Sohler
Affiliations:
University of Paderborn, Paderborn, Germany;University of Paderborn, Paderborn, Germany;University of Paderborn, Paderborn, Germany;University of Paderborn, Paderborn, Germany;Simon Fraser University, Burnaby, B.C., Canada;TU Dortmund, Dortmund, Germany
Venue:
Journal of Experimental Algorithmics (JEA)
Year:
2012

Citing 17
Cited 4

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator

ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special issue on uniform random number generation
Clustering Data Streams: Theory and Practice

IEEE Transactions on Knowledge and Data Engineering
Clustering data streams

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Streaming-Data Algorithms for High-Quality Clustering

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
On coresets for k-means and k-median clustering

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Approximating extent measures of points

Journal of the ACM (JACM)
A local search approximation algorithm for k-means clustering

Computational Geometry: Theory and Applications - Special issue on the 18th annual symposium on computational geometry—SoCG2002
Coresets in dynamic geometric data streams

Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
A PTAS for k-means clustering based on weak coresets

SCG '07 Proceedings of the twenty-third annual symposium on Computational geometry
k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
NP-hardness of Euclidean sum-of-squares clustering

Machine Learning
k-means requires exponentially many iterations even in the plane

Proceedings of the twenty-fifth annual symposium on Computational geometry
Adaptive Sampling for k-Means Clustering

APPROX '09 / RANDOM '09 Proceedings of the 12th International Workshop and 13th International Workshop on Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques
On Coresets for $k$-Median and $k$-Means Clustering in Metric and Euclidean Spaces and Their Applications

SIAM Journal on Computing
k-Means Has Polynomial Smoothed Complexity

FOCS '09 Proceedings of the 2009 50th Annual IEEE Symposium on Foundations of Computer Science
Clustering with Spectral Norm and the k-Means Algorithm

FOCS '10 Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science

Warped K-Means: An algorithm to cluster sequentially-distributed data

Information Sciences: an International Journal
Learning Big (Image) Data via Coresets for Dictionaries

Journal of Mathematical Imaging and Vision
Data stream clustering: A survey

ACM Computing Surveys (CSUR)
Survey of Clustering: Algorithms and Applications

International Journal of Information Retrieval Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

We develop a new k-means clustering algorithm for data streams of points from a Euclidean space. We call this algorithm StreamKM++. Our algorithm computes a small weighted sample of the data stream and solves the problem on the sample using the k-means++ algorithm of Arthur and Vassilvitskii (SODA '07). To compute the small sample, we propose two new techniques. First, we use an adaptive, nonuniform sampling approach similar to the k-means++ seeding procedure to obtain small coresets from the data stream. This construction is rather easy to implement and, unlike other coreset constructions, its running time has only a small dependency on the dimensionality of the data. Second, we propose a new data structure, which we call coreset tree. The use of these coreset trees significantly speeds up the time necessary for the adaptive, nonuniform sampling during our coreset construction. We compare our algorithm experimentally with two well-known streaming implementations: BIRCH [Zhang et al. 1997] and StreamLS [Guha et al. 2003]. In terms of quality (sum of squared errors), our algorithm is comparable with StreamLS and significantly better than BIRCH (up to a factor of 2). Besides, BIRCH requires significant effort to tune its parameters. In terms of running time, our algorithm is slower than BIRCH. Comparing the running time with StreamLS, it turns out that our algorithm scalesmuch better with increasing number of centers. We conclude that, if the first priority is the quality of the clustering, then our algorithm provides a good alternative to BIRCH and StreamLS, in particular, if the number of cluster centers is large. We also give a theoretical justification of our approach by proving that our sample set is a small coreset in low-dimensional spaces.