Fast and exact out-of-core and distributed k-means clustering

Authors:
Ruoming Jin;Anjan Goswami;Gagan Agrawal
Affiliations:
Department of Computer Science, Kent State University, USA;Department of Computer Science and Engineering, Ohio State University, USA;Department of Computer Science and Engineering, Ohio State University, USA
Venue:
Knowledge and Information Systems
Year:
2006

Citing 0
Cited 11

Boosting topic-based publish-subscribe systems with dynamic clustering

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Patch Relational Neural Gas --- Clustering of Huge Dissimilarity Datasets

ANNPR '08 Proceedings of the 3rd IAPR workshop on Artificial Neural Networks in Pattern Recognition
Adaptive learning of dynamic Bayesian networks with changing structures by detecting geometric structures of time series

Knowledge and Information Systems
Patch clustering for massive data sets

Neurocomputing
Median Topographic Maps for Biomedical Data Sets

Similarity-Based Clustering
Scalable learning of collective behavior based on sparse social dimensions

Proceedings of the 18th ACM conference on Information and knowledge management
Bulk construction of dynamic clustered metric trees

Knowledge and Information Systems
Lightweight clustering technique for distributed data mining applications

ICDM'07 Proceedings of the 7th industrial conference on Advances in data mining: theoretical aspects and applications
GIS enabled service site selection: Environmental analysis and beyond

Information Systems Frontiers
Multi-level Low-rank Approximation-based Spectral Clustering for image segmentation

Pattern Recognition Letters
Scalable K-Means by ranked retrieval

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset.In this paper, we present a new algorithm, called fast and exact k-means clustering (FEKM), which typically requires only one or a small number of passes on the entire dataset and provably produces the same cluster centres as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centres and then takes one or more passes over the entire dataset to adjust these cluster centres. We provide theoretical analysis to show that the cluster centres thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared with k-means.This paper also describes and evaluates a distributed version of FEKM, which we refer to as DFEKM. This algorithm is suitable for analysing data that is distributed across loosely coupled machines. Unlike the previous work in this area, DFEKM provably produces the same results as the original k-means algorithm. Our experimental results show that DFEKM is clearly better than two other possible options for exact clustering on distributed data, which are down loading all data and running sequential k-means or running parallel k-means on a loosely coupled configuration. Moreover, even in a tightly coupled environment, DFEKM can outperform parallel k-means if there is a significant load imbalance.