Co-clustering documents and words using bipartite spectral graph partitioning
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
K-means clustering via principal component analysis
ICML '04 Proceedings of the twenty-first international conference on Machine learning
The Journal of Machine Learning Research
Stability of feature selection algorithms: a study on high-dimensional spaces
Knowledge and Information Systems
Full regularization path for sparse principal component analysis
Proceedings of the 24th international conference on Machine learning
Stable feature selection via dense feature groups
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Stability Based Sparse LSI/PCA: Incorporating Feature Selection in LSI and PCA
ECML '07 Proceedings of the 18th European conference on Machine Learning
Robust Feature Selection Using Ensemble Feature Selection Techniques
ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Consensus group stable feature selection
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
On Feature Selection, Bias-Variance, and Bagging
ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
Unsupervised feature selection for multi-cluster data
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
A Variance Reduction Framework for Stable Feature Selection
ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Feature selection for k-means clustering stability: theoretical analysis and an algorithm
Data Mining and Knowledge Discovery
Hi-index | 0.00 |
Stability of a learning algorithm with respect to small input perturbations is an important property, as it implies the derived models to be robust with respect to the presence of noisy features and/or data sample fluctuations. In this paper we explore the effect of stability optimization in the standard feature selection process for the continuous (PCA-based) k-means clustering problem. Interestingly, we derive that stability maximization naturally introduces a tradeoff between cluster separation and variance, leading to the selection of features that have a high cluster separation index that is not artificially inflated by the feature's variance. The proposed algorithmic setup is based on a Sparse PCA approach, that selects the features that maximize stability in a greedy fashion. In our study, we also analyze several properties of Sparse PCA relevant to stability that promote Sparse PCA as a viable feature selection mechanism for clustering. The practical relevance of the proposed method is demonstrated in the context of cancer research, where we consider the problem of detecting potential tumor biomarkers using microarray gene expression data. The application of our method to a leukemia dataset shows that the tradeoff between cluster separation and variance leads to the selection of features corresponding to important biomarker genes. Some of them have relative low variance and are not detected without the direct optimization of stability in Sparse PCA based k-means.