Fast and Exact Out-of-Core K-Means Clustering

Authors:
Anjan Goswami;Ruoming Jin;Gagan Agrawal
Affiliations:
Ohio State University;Ohio State University;Ohio State University
Venue:
ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Year:
2004

Citing 0
Cited 7

Memory efficient subspace clustering for online data streams

IDEAS '08 Proceedings of the 2008 international symposium on Database engineering & applications
Efficiently tracing clusters over high-dimensional on-line data streams

Data & Knowledge Engineering
Using text mining and sentiment analysis for online forums hotspot detection and forecast

Decision Support Systems
Intelligent sequential mining via alignment: optimization techniques for very large DB

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Data compression by volume prototypes for streaming data

Pattern Recognition
Comprehensible and accurate cluster labels in text clustering

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Stratified k-means clustering over a deep web data source

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset. In this paper, we present a new algorithm which typically requires only one or a small numberof passes on the entire dataset, and provably produces the same cluster centers as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centers, and then takes one or more passes over the entire dataset to adjust these cluster centers. We provide theoretical analysis to show that the cluster centers thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared to k-means.