Short communication: Optimising k-means clustering results with standard software packages

Authors:
David J. Hand;Wojtek J. Krzanowski
Affiliations:
Department of Mathematics, Imperial College of Science, Technology and Medicine, Huxley Building, 180 Queen's Gate, London SW7 2BZ, UK;Department of Mathematical Sciences, University of Exeter, Laver Building, North Park Road, Exeter EX4 4QE, UK
Venue:
Computational Statistics & Data Analysis
Year:
2005

Citing 4
Cited 5

Ordering effects in clustering

ML92 Proceedings of the ninth international workshop on Machine learning
An empirical comparison of four initialization methods for the K-Means algorithm

Pattern Recognition Letters
Fuzzy clustering with squared Minkowski distances

Fuzzy Sets and Systems - Special issue on clustering and learning
An Efficient k-Means Clustering Algorithm: Analysis and Implementation

IEEE Transactions on Pattern Analysis and Machine Intelligence

Three-mode partitioning

Computational Statistics & Data Analysis
Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm

Computational Statistics & Data Analysis
A toolbox for K-centroids cluster analysis

Computational Statistics & Data Analysis
Kml: A package to cluster longitudinal data

Computer Methods and Programs in Biomedicine
An algorithm for high-dimensional traffic data clustering

FSKD'06 Proceedings of the Third international conference on Fuzzy Systems and Knowledge Discovery

Quantified Score

Hi-index	0.03

Visualization

Abstract

The k-means method of clustering is a very popular technique available on most standard statistical software packages. It is an iterative algorithm that requires specification of a starting configuration, and many packages use a random start unless the user declares otherwise. Typically, users are encouraged to run the analysis from a number of random starts and to take the best resultant solution. Some packages, however, base the default starting option on a preliminary analysis such as hierarchical clustering. This does not allow users to produce different ''replicate'' solutions, so the temptation is to treat the final solution as a global rather than local optimum. The dangers of drawing this conclusion are highlighted, an iterative scheme that generally improves on the default solution is suggested, and this scheme is compared with the ''best of 20 random starts'' method favoured by many users.