Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads

Authors:
Mark Ming-Tso Chiang;Boris Mirkin
Affiliations:
Birkbeck University of London, Department of Computer Science & Information Systems, London, UK;Birkbeck University of London, Department of Computer Science & Information Systems, London, UK and State University - Higher School of Economics, Moscow, Russia
Venue:
Journal of Classification
Year:
2010

Citing 0
Cited 9

Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering

Pattern Recognition
A two-stage genetic algorithm for automatic clustering

Neurocomputing
Decoding network activity from LFPs: a computational approach

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part I
On initializations for the minkowski weighted k-means

IDA'12 Proceedings of the 11th international conference on Advances in Intelligent Data Analysis
A classification model based on incomplete information on features in the form of their average values

Scientific and Technical Information Processing
STHist-C: a highly accurate cluster-based histogram for two and three dimensional geographic data points

Geoinformatica
An empirical evaluation of different initializations on the number of k-means iterations

MICAI'12 Proceedings of the 11th Mexican international conference on Advances in Artificial Intelligence - Volume Part I
Multimodal late fusion bag of features applied to scene detection

Proceedings of the 19th Brazilian symposium on Multimedia and the web
Online fuzzy medoid based clustering algorithms

Neurocomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The issue of determining “the right number of clusters” in K-Means has attracted considerable interest, especially in the recent years. Cluster intermix appears to be a factor most affecting the clustering results. This paper proposes an experimental setting for comparison of different approaches at data generated from Gaussian clusters with the controlled parameters of between- and within-cluster spread to model cluster intermix. The setting allows for evaluating the centroid recovery on par with conventional evaluation of the cluster recovery. The subjects of our interest are two versions of the “intelligent” K-Means method, ik-Means, that find the “right” number of clusters by extracting “anomalous patterns” from the data one-by-one. We compare them with seven other methods, including Hartigan’s rule, averaged Silhouette width and Gap statistic, under different between- and within-cluster spread-shape conditions. There are several consistent patterns in the results of our experiments, such as that the right K is reproduced best by Hartigan’s rule – but not clusters or their centroids. This leads us to propose an adjusted version of iK-Means, which performs well in the current experiment setting.