Feature selection for k-means clustering stability: theoretical analysis and an algorithm

Authors:
Dimitrios Mavroeidis;Elena Marchiori
Affiliations:
IBM Research---Ireland, Dublin 15, Ireland;Department of Computer Science, Faculty of Sciences, Radboud University, Nijmegen, The Netherlands 6525 AJ
Venue:
Data Mining and Knowledge Discovery
Year:
2014

Citing 26
Cited 0

Bagging predictors

Machine Learning
Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
Co-clustering documents and words using bipartite spectral graph partitioning

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
On Clustering Validation Techniques

Journal of Intelligent Information Systems
Kernel k-means: spectral clustering and normalized cuts

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
K-means clustering via principal component analysis

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Feature Selection for Unsupervised Learning

The Journal of Machine Learning Research
Feature Selection for Unsupervised and Supervised Inference: The Emergence of Sparsity in a Weight-Based Approach

The Journal of Machine Learning Research
Stability of feature selection algorithms: a study on high-dimensional spaces

Knowledge and Information Systems
Full regularization path for sparse principal component analysis

Proceedings of the 24th international conference on Machine learning
Spectral feature selection for supervised and unsupervised learning

Proceedings of the 24th international conference on Machine learning
Introduction to Information Retrieval

Introduction to Information Retrieval
Stable feature selection via dense feature groups

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Stability Based Sparse LSI/PCA: Incorporating Feature Selection in LSI and PCA

ECML '07 Proceedings of the 18th European conference on Machine Learning
Robust Feature Selection Using Ensemble Feature Selection Techniques

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Optimal Solutions for Sparse Principal Component Analysis

The Journal of Machine Learning Research
Enhancing the Stability of Spectral Ordering with Sparsification and Partial Supervision: Application to Paleontological Data

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Consensus group stable feature selection

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Adapting the right measures for K-means clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
On Feature Selection, Bias-Variance, and Bagging

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
Unsupervised feature selection for multi-cluster data

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Enhancing the stability and efficiency of spectral ordering with partial supervision and feature selection

Knowledge and Information Systems
A Variance Reduction Framework for Stable Feature Selection

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
A novel stability based feature selection framework for k-means clustering

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Data transformation for sum squared residue

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
A Sequential Sampling Framework for Spectral k-Means Based on Efficient Bootstrap Accuracy Estimations: Application to Distributed Clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stability of a learning algorithm with respect to small input perturbations is an important property, as it implies that the derived models are robust with respect to the presence of noisy features and/or data sample fluctuations. The qualitative nature of the stability property enhardens the development of practical, stability optimizing, data mining algorithms as several issues naturally arise, such as: how "much" stability is enough, or how can stability be effectively associated with intrinsic data properties. In the context of this work we take into account these issues and explore the effect of stability maximization in the continuous (PCA-based) k-means clustering problem. Our analysis is based on both mathematical optimization and statistical arguments that complement each other and allow for the solid interpretation of the algorithm's stability properties. Interestingly, we derive that stability maximization naturally introduces a tradeoff between cluster separation and variance, leading to the selection of features that have a high cluster separation index that is not artificially inflated by the features variance. The proposed algorithmic setup is based on a Sparse PCA approach, that selects the features that maximize stability in a greedy fashion. In our study, we also analyze several properties of Sparse PCA relevant to stability that promote Sparse PCA as a viable feature selection mechanism for clustering. The practical relevance of the proposed method is demonstrated in the context of cancer research, where we consider the problem of detecting potential tumor biomarkers using microarray gene expression data. The application of our method to a leukemia dataset shows that the tradeoff between cluster separation and variance leads to the selection of features corresponding to important biomarker genes. Some of them have relative low variance and are not detected without the direct optimization of stability in Sparse PCA based k-means. Apart from the qualitative evaluation, we have also verified our approach as a feature selection method for $$k$$k-means clustering using four cancer research datasets. The quantitative empirical results illustrate the practical utility of our framework as a feature selection mechanism for clustering.