In search of deterministic methods for initializing K-means and Gaussian mixture clustering

Authors:
Ting Su;Jennifer G. Dy
Affiliations:
(Correspd. Tel.: +1 617 373 3975/ Fax: +1 617 373 8970/ E-mail: tsu@ece.neu.edu) Department of Electrical and Computer Engineering, Northeastern University, Boston, MA 02115, USA;Department of Electrical and Computer Engineering, Northeastern University, Boston, MA 02115, USA
Venue:
Intelligent Data Analysis
Year:
2007

Citing 14
Cited 5

Introduction to statistical pattern recognition (2nd ed.)

Introduction to statistical pattern recognition (2nd ed.)
New methods for the initialisation of clusters

Pattern Recognition Letters
Data clustering: a review

ACM Computing Surveys (CSUR)
An empirical comparison of four initialization methods for the K-Means algorithm

Pattern Recognition Letters
Data mining: concepts and techniques

Data mining: concepts and techniques
Numerical Recipes in C++: the art of scientific computing

Numerical Recipes in C++: the art of scientific computing
Principal Direction Divisive Partitioning

Data Mining and Knowledge Discovery
Induction of Decision Trees

Machine Learning
Unsupervised Feature Selection Applied to Content-Based Retrieval of Lung Images

IEEE Transactions on Pattern Analysis and Machine Intelligence
Refining Initial Points for K-Means Clustering

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Experiments with Random Projection

UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions

The Journal of Machine Learning Research
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
An experimental comparison of several clustering and initialization methods

UAI'98 Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence

Predicting Category Additions in a Topic Hierarchy

ASWC '08 Proceedings of the 3rd Asian Semantic Web Conference on The Semantic Web
Unfolding speaker clustering potential: a biomimetic approach

MM '09 Proceedings of the 17th ACM international conference on Multimedia
Isolating top-k dense regions with filtration of sparse background

Pattern Recognition Letters
Statistical modeling of dissimilarity increments for d-dimensional data: Application in partitional clustering

Pattern Recognition
A comparative study of efficient initialization methods for the k-means clustering algorithm

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The performance of K-means and Gaussian mixture model (GMM) clustering depends on the initial guess of partitions. Typically, clustering algorithms are initialized by random starts. In our search for a deterministic method, we found two promising approaches: principal component analysis (PCA) partitioning and Var-Part (Variance Partitioning). K-means clustering tries to minimize the sum-squared-error criterion. The largest eigenvector with the largest eigenvalue is the component which contributes to the largest sum-squared-error. Hence, a good candidate direction to project a cluster for splitting is the direction of the cluster's largest eigenvector, the basis for PCA partitioning. Similarly, GMM clustering maximizes the likelihood; minimizing the determinant of the covariance matrices of each cluster helps to increase the likelihood. The largest eigenvector contributes to the largest determinant and is thus a good candidate direction for splitting. However, PCA is computationally expensive. We, thus, introduce Var-Part, which is computationally less complex (with complexity equal to one K-means iteration) and approximates PCA partitioning assuming diagonal covariance matrix. Experiments reveal that Var-Part has similar performance with PCA partitioning, sometimes better, and leads K-means (and GMM) to yield sum-squared-error (and maximum-likelihood) values close to the optimum values obtained by several random-start runs and often at faster convergence rates.