Parallelizing Clustering of Geoscientific Data Sets using Data Streams

Authors:
Silvia Nittel;Kelvin T. Leung
Affiliations:
University of Maine, Orono;University of California, Los Angeles
Venue:
SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
Year:
2004

Citing 0
Cited 2

Patch clustering for massive data sets

Neurocomputing
Efficient Learning from Massive Spatial-Temporal Data through Selective Support Vector Propagation

Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy

Quantified Score

Hi-index	0.00

Visualization

Abstract

Computing data mining algorithms such as clustering onmassive geospatial data sets is still not feasible nor efficienttoday. In this paper, we introduce a k-means algorithmthat is based on the data stream paradigm. The so-calledpartial/merge k-means algorithm is implemented as a setof data stream operators which are adaptable to availablecomputing resources such as volatile memory and processingpower. The partial data stream operator consumes asmuch data as can be fit into RAM, and performs a weightedk-means on the data subset. Subsequently, the weightedpartial results are merged by a second data stream operator.All operators can be cloned, and parallelized. Inour analytical and experimental performance evaluation,we demonstrate that the partial/merge k-means can outperforma one-step algorithm by a large margin with regardto overall computation time and clustering quality with increasingdata density per grid cell.