A theoretical characterization of linear SVM-based feature selection

Authors:
Douglas Hardin;Ioannis Tsamardinos;Constantin F. Aliferis
Affiliations:
Vanderbilt University, Nashville, TN;Vanderbilt University, Nashville, TN;Vanderbilt University, Nashville, TN
Venue:
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Year:
2004

Citing 9
Cited 16

The nature of statistical learning theory

The nature of statistical learning theory
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
Advances in Large Margin Classifiers

Advances in Large Margin Classifiers
Gene Selection for Cancer Classification using Support Vector Machines

Machine Learning
An introduction to variable and feature selection

The Journal of Machine Learning Research
Time and sample efficient discovery of Markov blankets and direct causal relations

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Sparseness of support vector machines

The Journal of Machine Learning Research
Strong completeness and faithfulness in Bayesian networks

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence
LDA/SVM driven nearest neighbor classification

IEEE Transactions on Neural Networks

OCFS: optimal orthogonal centroid feature selection for text categorization

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Consistent Feature Selection for Pattern Recognition in Polynomial Time

The Journal of Machine Learning Research
A multiple kernel support vector machine scheme for feature selection and rule extraction from gene expression data of cancer tissue

Artificial Intelligence in Medicine
Derivative reproducing properties for kernel methods in learning theory

Journal of Computational and Applied Mathematics
Using Markov Blankets for Causal Structure Learning

The Journal of Machine Learning Research
Parzen windows for multi-class classification

Journal of Complexity
SVM Based Decision Analysis and Its Granular-Based Solving

ICCSA '09 Proceedings of the International Conference on Computational Science and Its Applications: Part II
Avoidance of model re-induction in SVM-based feature selection for text categorization

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Gradient learning in a classification setting by gradient descent

Journal of Approximation Theory
Hermite learning with gradient data

Journal of Computational and Applied Mathematics
Classification with Gaussians and Convex Loss

The Journal of Machine Learning Research
Online Learning with Samples Drawn from Non-identical Distributions

The Journal of Machine Learning Research
Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation

The Journal of Machine Learning Research
Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part II: Analysis and Extensions

The Journal of Machine Learning Research
Conditional quantiles with varying Gaussians

Advances in Computational Mathematics
To feature space and back: Identifying top-weighted features in polynomial Support Vector Machine models

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most prevalent techniques in Support Vector Machine (SVM) feature selection are based on the intuition that the weights of features that are close to zero are not required for optimal classification. In this paper we show that indeed, in the sample limit, the irrelevant variables (in a theoretical and optimal sense) will be given zero weight by a linear SVM, both in the soft and the hard margin case. However, SVM-based methods have certain theoretical disadvantages too. We present examples where the linear SVM may assign zero weights to strongly relevant variables (i.e., variables required for optimal estimation of the distribution of the target variable) and where weakly relevant features (i.e., features that are superfluous for optimal feature selection given other features) may get non-zero weights. We contrast and theoretically compare with Markov-Blanket based feature selection algorithms that do not have such disadvantages in a broad class of distributions and could also be used for causal discovery.