Approximate Clustering on Distributed Data Streams

  • Authors:
  • Qi Zhang;Jinze Liu;Wei Wang

  • Affiliations:
  • Department of Computer Science, University of North Carolina, Chapel Hill, Chapel Hill, NC 27599-3175, USA. zhangq@cs.unc.edu;Department of Computer Science, University of Kentucky, Lexington, KY 40506-0046, USA. liuj@netlab.uky.edu;Department of Computer Science, University of North Carolina, Chapel Hill, Chapel Hill, NC 27599-3175, USA. weiwang@cs.unc.edu

  • Venue:
  • ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

We investigate the problem of clustering on distributed data streams. In particular, we consider the k-median clustering on stream data arriving at distributed sites which communicate through a routing tree. Distributed clustering on high speed data streams is a challenging task due to limited communication capacity, storage space, and computing power at each site. In this paper, we propose a suite of algorithms for computing (1 + epsiv) -approximate k-median clustering over distributed data streams under three different topology settings: topology-oblivious, height-aware, and path-aware. Our algorithms reduce the maximum per node transmission to polylog N (opposed to Omega(N) for transmitting the raw data). We have simulated our algorithms on a distributed stream system with both real and synthetic datasets composed of millions of data. In practice, our algorithms are able to reduce the data transmission to a small fraction of the original data. Moreover, our results indicate that the algorithms are scalable with respect to the data volume, approximation factor, and the number of sites.