Boosting topic-based publish-subscribe systems with dynamic clustering
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Patch Relational Neural Gas --- Clustering of Huge Dissimilarity Datasets
ANNPR '08 Proceedings of the 3rd IAPR workshop on Artificial Neural Networks in Pattern Recognition
Knowledge and Information Systems
Patch clustering for massive data sets
Neurocomputing
Median Topographic Maps for Biomedical Data Sets
Similarity-Based Clustering
Scalable learning of collective behavior based on sparse social dimensions
Proceedings of the 18th ACM conference on Information and knowledge management
Bulk construction of dynamic clustered metric trees
Knowledge and Information Systems
Lightweight clustering technique for distributed data mining applications
ICDM'07 Proceedings of the 7th industrial conference on Advances in data mining: theoretical aspects and applications
GIS enabled service site selection: Environmental analysis and beyond
Information Systems Frontiers
Multi-level Low-rank Approximation-based Spectral Clustering for image segmentation
Pattern Recognition Letters
Scalable K-Means by ranked retrieval
Proceedings of the 7th ACM international conference on Web search and data mining
Hi-index | 0.00 |
Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset.In this paper, we present a new algorithm, called fast and exact k-means clustering (FEKM), which typically requires only one or a small number of passes on the entire dataset and provably produces the same cluster centres as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centres and then takes one or more passes over the entire dataset to adjust these cluster centres. We provide theoretical analysis to show that the cluster centres thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared with k-means.This paper also describes and evaluates a distributed version of FEKM, which we refer to as DFEKM. This algorithm is suitable for analysing data that is distributed across loosely coupled machines. Unlike the previous work in this area, DFEKM provably produces the same results as the original k-means algorithm. Our experimental results show that DFEKM is clearly better than two other possible options for exact clustering on distributed data, which are down loading all data and running sequential k-means or running parallel k-means on a loosely coupled configuration. Moreover, even in a tightly coupled environment, DFEKM can outperform parallel k-means if there is a significant load imbalance.