Analysis of feature selection stability on high dimension and small sample data

Authors:
David Dernoncourt;Blaise Hanczar;Jean-Daniel Zucker
Affiliations:
-;-;-
Venue:
Computational Statistics & Data Analysis
Year:
2014

Citing 9
Cited 0

Gene Selection for Cancer Classification using Support Vector Machines

Machine Learning
Stability of feature selection algorithms: a study on high-dimensional spaces

Knowledge and Information Systems
A review of feature selection techniques in bioinformatics

Bioinformatics
Criteria Ensembles in Feature Selection

MCS '09 Proceedings of the 8th International Workshop on Multiple Classifier Systems
Gene ranking and biomarker discovery under correlation

Bioinformatics
Improving stability of feature selection methods

CAIP'07 Proceedings of the 12th international conference on Computer analysis of images and patterns
Evaluating Stability and Comparing Output of Feature Selectors that Optimize Feature Subset Cardinality

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Variance Reduction Framework for Stable Feature Selection

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
Robust variable selection through MAVE

Computational Statistics & Data Analysis

Quantified Score

Hi-index	0.03

Visualization

Abstract

Feature selection is an important step when building a classifier on high dimensional data. As the number of observations is small, the feature selection tends to be unstable. It is common that two feature subsets, obtained from different datasets but dealing with the same classification problem, do not overlap significantly. Although it is a crucial problem, few works have been done on the selection stability. The behavior of feature selection is analyzed in various conditions, not exclusively but with a focus on t-score based feature selection approaches and small sample data. The analysis is in three steps: the first one is theoretical using a simple mathematical model; the second one is empirical and based on artificial data; and the last one is based on real data. These three analyses lead to the same results and give a better understanding of the feature selection problem in high dimension data.