Feature subset selection bias for classification learning

Authors:
Surendra K. Singhi;Huan Liu
Affiliations:
Arizona State University, Tempe, AZ;Arizona State University, Tempe, AZ
Venue:
ICML '06 Proceedings of the 23rd international conference on Machine learning
Year:
2006

Citing 11
Cited 11

Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Multiple Comparisons in Induction Algorithms

Machine Learning
Gene Selection for Cancer Classification using Support Vector Machines

Machine Learning
Linkage and Autocorrelation Cause Feature Selection Bias in Relational Learning

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Inference for the Generalization Error

Machine Learning
An introduction to variable and feature selection

The Journal of Machine Learning Research
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Learning and evaluating classifiers under sample selection bias

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Toward Integrating Feature Selection Algorithms for Classification and Clustering

IEEE Transactions on Knowledge and Data Engineering

A Stochastic Algorithm for Feature Selection in Pattern Recognition

The Journal of Machine Learning Research
The peaking phenomenon in the presence of feature-selection

Pattern Recognition Letters
Scalable Feature Selection for Multi-class Problems

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Hybridization of Evolutionary Mechanisms for Feature Subset Selection in Unsupervised Learning

MICAI '09 Proceedings of the 8th Mexican International Conference on Artificial Intelligence
Combining Naive-Bayesian Classifier and Genetic Clustering for Effective Anomaly Based Intrusion Detection

RSFDGrC '09 Proceedings of the 12th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing
A novel information theoretic-interact algorithm (IT-IN) for feature selection using three machine learning algorithms

Expert Systems with Applications: An International Journal
Arabic script web page language identifications using decision tree neural networks

Pattern Recognition
Review Article: Biometric personal authentication using keystroke dynamics: A review

Applied Soft Computing
A methodology for comparing classification methods through the assessment of model stability and validity in variable selection

Decision Support Systems
Feature selection for high-dimensional imbalanced data

Neurocomputing
Feature selections for authorship attribution

Proceedings of the 28th Annual ACM Symposium on Applied Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Feature selection is often applied to high-dimensional data prior to classification learning. Using the same training dataset in both selection and learning can result in so-called feature subset selection bias. This bias putatively can exacerbate data over-fitting and negatively affect classification performance. However, in current practice separate datasets are seldom employed for selection and learning, because dividing the training data into two datasets for feature selection and classifier learning respectively reduces the amount of data that can be used in either task. This work attempts to address this dilemma. We formalize selection bias for classification learning, analyze its statistical properties, and study factors that affect selection bias, as well as how the bias impacts classification learning via various experiments. This research endeavors to provide illustration and explanation why the bias may not cause negative impact in classification as much as expected in regression.