Acceleration of K-Means and Related Clustering Algorithms

Authors:
Steven J. Phillips
Affiliations:
-
Venue:
ALENEX '02 Revised Papers from the 4th International Workshop on Algorithm Engineering and Experiments
Year:
2002

Citing 4
Cited 14

Accelerating exact k-means algorithms with geometric reasoning

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Scalability for clustering algorithms revisited

ACM SIGKDD Explorations Newsletter
Introductory Digital Image Processing: A Remote Sensing Perspective

Introductory Digital Image Processing: A Remote Sensing Perspective
The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data

UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence

A local search approximation algorithm for k-means clustering

Proceedings of the eighteenth annual symposium on Computational geometry
A local search approximation algorithm for k-means clustering

Computational Geometry: Theory and Applications - Special issue on the 18th annual symposium on computational geometry—SoCG2002
Centroidal Voronoi Tessellation Algorithms for Image Compression, Segmentation, and Multichannel Restoration

Journal of Mathematical Imaging and Vision
A fast k-means implementation using coresets

Proceedings of the twenty-second annual symposium on Computational geometry
A scalable algorithm for high-quality clustering of web snippets

Proceedings of the 2006 ACM symposium on Applied computing
VISTO: visual storyboard for web video browsing

Proceedings of the 6th ACM international conference on Image and video retrieval
An edge-weighted centroidal Voronoi tessellation model for image segmentation

IEEE Transactions on Image Processing
STIMO: STIll and MOving video storyboard for the web scenario

Multimedia Tools and Applications
Effective initialization of k-means for color quantization

ICIP'09 Proceedings of the 16th IEEE international conference on Image processing
Improving the performance of k-means for color quantization

Image and Vision Computing
Using Multi-Modal Semantic Association Rules to fuse keywords and visual features automatically for Web image retrieval

Information Fusion
Streaming k-means on well-clusterable data

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
A new clustering algorithm based on k-means using a line segment as prototype

CIARP'11 Proceedings of the 16th Iberoamerican Congress conference on Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
The effectiveness of lloyd-type methods for the k-means problem

Journal of the ACM (JACM)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes two simple modification of K-means and related algorithms for clustering, that improve the running time without changing the output. The two resulting algorithms are called Compare-means and Sort-means. The time for an iteration of K-means is reduced from O(ndk), where n is the number of data points, k the number of clusters and d the dimension, to O(nd驴 + k2d + k2 log k) for Sort-means. Here 驴驴 k is the average over all points p of the number of means that are no more than twice as far as p is from the mean p was assigned to in the previous iteration. Compare-means performs a similar number of distance calculations as Sort-means, and is faster when the number of means is very large. Both modifications are extremely simple, and could easily be added to existing clustering implementations.We investigate the empirical performance of the algorithms on three datasets drawn from practical applications. As a primary test case, we use the Isodata variant of K-means on a sample of 2.3 million 6-dimensional points drawn from a Landsat-7 satellite image. For this dataset, 驴 quickly drops to less than log2 k, and the running time decreases accordingly. For example, a run with k = 100 drops from an hour and a half to sixteen minutes for Compare-means and six and a half minutes for Sortmeans. Further experiments show similar improvements on datasets derived from a forestry application and from the analysis of BGP updates in an IP network.