Stability Based Sparse LSI/PCA: Incorporating Feature Selection in LSI and PCA

Authors:
Dimitrios Mavroeidis;Michalis Vazirgiannis
Affiliations:
Department of Informatics, Athens University of Economics and Business, Greece;Department of Informatics, Athens University of Economics and Business, Greece and GEMO Team, INRIA/FUTURS, France
Venue:
ECML '07 Proceedings of the 18th European conference on Machine Learning
Year:
2007

Citing 8
Cited 4

Latent semantic indexing: a probabilistic analysis

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Kernel PCA and de-noising in feature spaces

Proceedings of the 1998 conference on Advances in neural information processing systems II
Latent Semantic Kernels

Journal of Intelligent Information Systems
Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data

Machine Learning
Stability and generalization

The Journal of Machine Learning Research
Stability-based validation of clustering solutions

Neural Computation
Resampling Method for Unsupervised Estimation of Cluster Validity

Neural Computation
A sober look at clustering stability

COLT'06 Proceedings of the 19th annual conference on Learning Theory

Enhancing the stability and efficiency of spectral ordering with partial supervision and feature selection

Knowledge and Information Systems
A novel stability based feature selection framework for k-means clustering

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
A Sequential Sampling Framework for Spectral k-Means Based on Efficient Bootstrap Accuracy Estimations: Application to Distributed Clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Feature selection for k-means clustering stability: theoretical analysis and an algorithm

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

The stability of sample based algorithms is a concept commonly used for parameter tuning and validity assessment. In this paper we focus on two well studied algorithms, LSI and PCA, and propose a feature selection process that provably guarantees the stability of their outputs. The feature selection process is performed such that the level of (statistical) accuracy of the LSI/PCA input matrices is adequate for computing meaningful (stable) eigenvectors. The feature selection process "sparsifies" LSI/PCA, resulting in the projection of the instances on the eigenvectors of a principal submatrix of the original input matrix, thus producing sparse factor loadings that are linear combinations solely of the selected features. We utilize bootstrapping confidence intervals for assessing the statistical accuracy of the input sample matrices, and matrix perturbation theory in order to relate the statistical accuracy to the stability of eigenvectors. Experiments on several UCI-datasets verify empirically our approach.