Parallelizing Clustering of Geoscientific Data Sets using Data Streams

  • Authors:
  • Silvia Nittel;Kelvin T. Leung

  • Affiliations:
  • University of Maine, Orono;University of California, Los Angeles

  • Venue:
  • SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Computing data mining algorithms such as clustering onmassive geospatial data sets is still not feasible nor efficienttoday. In this paper, we introduce a k-means algorithmthat is based on the data stream paradigm. The so-calledpartial/merge k-means algorithm is implemented as a setof data stream operators which are adaptable to availablecomputing resources such as volatile memory and processingpower. The partial data stream operator consumes asmuch data as can be fit into RAM, and performs a weightedk-means on the data subset. Subsequently, the weightedpartial results are merged by a second data stream operator.All operators can be cloned, and parallelized. Inour analytical and experimental performance evaluation,we demonstrate that the partial/merge k-means can outperforma one-step algorithm by a large margin with regardto overall computation time and clustering quality with increasingdata density per grid cell.