Acceleration of K-Means and Related Clustering Algorithms

  • Authors:
  • Steven J. Phillips

  • Affiliations:
  • -

  • Venue:
  • ALENEX '02 Revised Papers from the 4th International Workshop on Algorithm Engineering and Experiments
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes two simple modification of K-means and related algorithms for clustering, that improve the running time without changing the output. The two resulting algorithms are called Compare-means and Sort-means. The time for an iteration of K-means is reduced from O(ndk), where n is the number of data points, k the number of clusters and d the dimension, to O(nd驴 + k2d + k2 log k) for Sort-means. Here 驴 驴 k is the average over all points p of the number of means that are no more than twice as far as p is from the mean p was assigned to in the previous iteration. Compare-means performs a similar number of distance calculations as Sort-means, and is faster when the number of means is very large. Both modifications are extremely simple, and could easily be added to existing clustering implementations.We investigate the empirical performance of the algorithms on three datasets drawn from practical applications. As a primary test case, we use the Isodata variant of K-means on a sample of 2.3 million 6-dimensional points drawn from a Landsat-7 satellite image. For this dataset, 驴 quickly drops to less than log2 k, and the running time decreases accordingly. For example, a run with k = 100 drops from an hour and a half to sixteen minutes for Compare-means and six and a half minutes for Sortmeans. Further experiments show similar improvements on datasets derived from a forestry application and from the analysis of BGP updates in an IP network.