Feature over-selection

Authors:
Sarunas Raudys
Affiliations:
Vilnius Gediminas Technical University, Vilnius, Lithuania
Venue:
SSPR'06/SPR'06 Proceedings of the 2006 joint IAPR international conference on Structural, Syntactic, and Statistical Pattern Recognition
Year:
2006

Citing 8
Cited 3

Introduction to statistical pattern recognition (2nd ed.)

Introduction to statistical pattern recognition (2nd ed.)
Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners

IEEE Transactions on Pattern Analysis and Machine Intelligence
Floating search methods in feature selection

Pattern Recognition Letters
Statistical and neural classifiers: an integrated approach to design

Statistical and neural classifiers: an integrated approach to design
Neural Networks: A Comprehensive Foundation

Neural Networks: A Comprehensive Foundation
Preventing "Overfitting" of Cross-Validation Data

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An introduction to variable and feature selection

The Journal of Machine Learning Research
Process-oriented estimation of generalization error

IJCAI'99 Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2

Evaluating the Stability of Feature Selectors That Optimize Feature Subset Cardinality

SSPR & SPR '08 Proceedings of the 2008 Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition
Efficient Online Classification Using an Ensemble of Bayesian Linear Logistic Regressors

MCS '09 Proceedings of the 8th International Workshop on Multiple Classifier Systems
Multicategory nets of single-layer perceptrons: complexity and sample-size issues

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose probabilistic framework for analysis of inaccuracies due to feature selection (FS) when flawed estimates of performance of feature subsets are utilized. The approach is based on analysis of random search FS procedure and postulation that joint distribution of true and estimated classification errors is known a priori. We derive expected values for the FS bias, a difference between actual classification error after FS and classification error if ideal FS is performed according to exact estimates. The increase in true classification error due to inaccurate FS is comparable or even exceeds a training bias, a difference between generalization and Bayes errors. We have shown that there exists overfitting phenomenon in feature selection, entitled in this paper as feature over-selection. The effects of feature over-selection could be reduced if FS would be performed on basis of positional statistics. Theoretical results are supported by experiments carried out on simulated Gaussian data, as well as on high dimensional microarray gene expression data.