Approximate Clustering on Distributed Data Streams

Authors:
Qi Zhang;Jinze Liu;Wei Wang
Affiliations:
Department of Computer Science, University of North Carolina, Chapel Hill, Chapel Hill, NC 27599-3175, USA. zhangq@cs.unc.edu;Department of Computer Science, University of Kentucky, Lexington, KY 40506-0046, USA. liuj@netlab.uky.edu;Department of Computer Science, University of North Carolina, Chapel Hill, Chapel Hill, NC 27599-3175, USA. weiwang@cs.unc.edu
Venue:
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Year:
2008

Citing 0
Cited 4

Toward visual analysis of ensemble data sets

Proceedings of the 2009 Workshop on Ultrascale Visualization
Continuously identifying representatives out of massive streams

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part I
Prediction-based geometric monitoring over distributed data streams

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
A Sequential Sampling Framework for Spectral k-Means Based on Efficient Bootstrap Accuracy Estimations: Application to Distributed Clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate the problem of clustering on distributed data streams. In particular, we consider the k-median clustering on stream data arriving at distributed sites which communicate through a routing tree. Distributed clustering on high speed data streams is a challenging task due to limited communication capacity, storage space, and computing power at each site. In this paper, we propose a suite of algorithms for computing (1 + epsiv) -approximate k-median clustering over distributed data streams under three different topology settings: topology-oblivious, height-aware, and path-aware. Our algorithms reduce the maximum per node transmission to polylog N (opposed to Omega(N) for transmitting the raw data). We have simulated our algorithms on a distributed stream system with both real and synthetic datasets composed of millions of data. In practice, our algorithms are able to reduce the data transmission to a small fraction of the original data. Moreover, our results indicate that the algorithms are scalable with respect to the data volume, approximation factor, and the number of sites.