Introduction to statistical pattern recognition (2nd ed.)
Introduction to statistical pattern recognition (2nd ed.)
New methods for the initialisation of clusters
Pattern Recognition Letters
ACM Computing Surveys (CSUR)
An empirical comparison of four initialization methods for the K-Means algorithm
Pattern Recognition Letters
Data mining: concepts and techniques
Data mining: concepts and techniques
Numerical Recipes in C++: the art of scientific computing
Numerical Recipes in C++: the art of scientific computing
Principal Direction Divisive Partitioning
Data Mining and Knowledge Discovery
Machine Learning
Unsupervised Feature Selection Applied to Content-Based Retrieval of Lung Images
IEEE Transactions on Pattern Analysis and Machine Intelligence
Refining Initial Points for K-Means Clustering
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Experiments with Random Projection
UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions
The Journal of Machine Learning Research
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
An experimental comparison of several clustering and initialization methods
UAI'98 Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence
Predicting Category Additions in a Topic Hierarchy
ASWC '08 Proceedings of the 3rd Asian Semantic Web Conference on The Semantic Web
Unfolding speaker clustering potential: a biomimetic approach
MM '09 Proceedings of the 17th ACM international conference on Multimedia
Isolating top-k dense regions with filtration of sparse background
Pattern Recognition Letters
A comparative study of efficient initialization methods for the k-means clustering algorithm
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
The performance of K-means and Gaussian mixture model (GMM) clustering depends on the initial guess of partitions. Typically, clustering algorithms are initialized by random starts. In our search for a deterministic method, we found two promising approaches: principal component analysis (PCA) partitioning and Var-Part (Variance Partitioning). K-means clustering tries to minimize the sum-squared-error criterion. The largest eigenvector with the largest eigenvalue is the component which contributes to the largest sum-squared-error. Hence, a good candidate direction to project a cluster for splitting is the direction of the cluster's largest eigenvector, the basis for PCA partitioning. Similarly, GMM clustering maximizes the likelihood; minimizing the determinant of the covariance matrices of each cluster helps to increase the likelihood. The largest eigenvector contributes to the largest determinant and is thus a good candidate direction for splitting. However, PCA is computationally expensive. We, thus, introduce Var-Part, which is computationally less complex (with complexity equal to one K-means iteration) and approximates PCA partitioning assuming diagonal covariance matrix. Experiments reveal that Var-Part has similar performance with PCA partitioning, sometimes better, and leads K-means (and GMM) to yield sum-squared-error (and maximum-likelihood) values close to the optimum values obtained by several random-start runs and often at faster convergence rates.