A comparative study of efficient initialization methods for the k-means clustering algorithm

Authors:
M. Emre Celebi;Hassan A. Kingravi;Patricio A. Vela
Affiliations:
Department of Computer Science, Louisiana State University, Shreveport, LA, USA;School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA;School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA
Venue:
Expert Systems with Applications: An International Journal
Year:
2013

Citing 37
Cited 2

A near-optimal initial seed value selection in K-means algorithm using a genetic algorithm

Pattern Recognition Letters
New methods for the initialisation of clusters

Pattern Recognition Letters
Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator

ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special issue on uniform random number generation
Data clustering: a review

ACM Computing Surveys (CSUR)
An empirical comparison of four initialization methods for the K-Means algorithm

Pattern Recognition Letters
LOF: identifying density-based local outliers

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
An Efficient k-Means Clustering Algorithm: Analysis and Implementation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Refining Initial Points for K-Means Clustering

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
A Divise Initialisation Method for Clustering Algorithms

PKDD '99 Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery
'1 + 1 2': Merging Distance and Density Based Clustering

DASFAA '01 Proceedings of the 7th International Conference on Database Systems for Advanced Applications
Performance criteria for graph clustering and Markov cluster experiments

Performance criteria for graph clustering and Markov cluster experiments
Efficient Disk-Based K-Means Clustering for Relational Databases

IEEE Transactions on Knowledge and Data Engineering
A method for initialising the K-means clustering algorithm using kd-trees

Pattern Recognition Letters
Comparing clusterings---an information based distance

Journal of Multivariate Analysis
k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Hierarchical initialization approach for K-Means clustering

Pattern Recognition Letters
In search of deterministic methods for initializing K-means and Gaussian mixture clustering

Intelligent Data Analysis
External validation measures for K-means clustering: A data distribution perspective

Expert Systems with Applications: An International Journal
A study on the use of statistical tests for experimentation with neural networks: Analysis of parametric test conditions and non-parametric tests

Expert Systems with Applications: An International Journal
NP-hardness of Euclidean sum-of-squares clustering

Machine Learning
A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability

Soft Computing - A Fusion of Foundations, Methodologies and Applications
An initialization method for the K-Means algorithm using neighborhood model

Computers & Mathematics with Applications
Adapting the right measures for K-means clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Robust partitional clustering by outlier and density insensitive seeding

Pattern Recognition Letters
SAS/STAT 9.2 User's Guide: Survival Analysis

SAS/STAT 9.2 User's Guide: Survival Analysis
Data clustering: 50 years beyond K-means

Pattern Recognition Letters
Improved step size adaptation for the MO-CMA-ES

Proceedings of the 12th annual conference on Genetic and evolutionary computation
Bandwidth adaptive hardware architecture of K-Means clustering for video analysis

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Improving the performance of k-means for color quantization

Image and Vision Computing
Parallel Spectral Clustering in Distributed Systems

IEEE Transactions on Pattern Analysis and Machine Intelligence
K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality

IEEE Transactions on Pattern Analysis and Machine Intelligence
Robust clustering by pruning outliers

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
The planar k-means problem is NP-hard

Theoretical Computer Science
Least squares quantization in PCM

IEEE Transactions on Information Theory
A comparison of several vector quantization codebook generation approaches

IEEE Transactions on Image Processing
A self-organizing network for hyperellipsoidal clustering (HEC)

IEEE Transactions on Neural Networks
Fast and robust fixed-point algorithms for independent component analysis

IEEE Transactions on Neural Networks

Spatial pattern recognition of seismic events in South West Colombia

Computers & Geosciences
Modelling the distribution of solar spectral irradiance using data mining techniques

Environmental Modelling & Software

Quantified Score

Hi-index	12.05

Visualization

Abstract

K-means is undoubtedly the most widely used partitional clustering algorithm. Unfortunately, due to its gradient descent nature, this algorithm is highly sensitive to the initial placement of the cluster centers. Numerous initialization methods have been proposed to address this problem. In this paper, we first present an overview of these methods with an emphasis on their computational efficiency. We then compare eight commonly used linear time complexity initialization methods on a large and diverse collection of data sets using various performance criteria. Finally, we analyze the experimental results using non-parametric statistical tests and provide recommendations for practitioners. We demonstrate that popular initialization methods often perform poorly and that there are in fact strong alternatives to these methods.