Separating the wheat from the chaff: on feature selection and feature importance in regression random forests and symbolic regression

Authors:
Sean Stijven;Wouter Minnebo;Katya Vladislavleva
Affiliations:
University of Antwerp, Antwerp, Belgium;University of Antwerp, Antwerp, Belgium;University of Antwerp, Antwerp, Belgium
Venue:
Proceedings of the 13th annual conference companion on Genetic and evolutionary computation
Year:
2011

Citing 9
Cited 1

Bagging predictors

Machine Learning
Random Forests

Machine Learning
Gene Selection for Cancer Classification using Support Vector Machines

Machine Learning
An introduction to variable and feature selection

The Journal of Machine Learning Research
Variable selection using svm based criteria

The Journal of Machine Learning Research
Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy

IEEE Transactions on Pattern Analysis and Machine Intelligence
Knowledge mining with genetic programming methods for variable selection in flavor design

Proceedings of the 12th annual conference on Genetic and evolutionary computation
Symbolic regression using nearest neighbor indexing

Proceedings of the 12th annual conference companion on Genetic and evolutionary computation
Variable selection using random forests

Pattern Recognition Letters

Effects of constant optimization by nonlinear least squares minimization in symbolic regression

Proceedings of the 15th annual conference companion on Genetic and evolutionary computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Feature selection in high-dimensional data sets is an open problem with no universal satisfactory method available. In this paper we discuss the requirements for such a method with respect to the various aspects of feature importance and explore them using regression random forests and symbolic regression. We study 'conventional' feature selection with both methods on several test problems and a case study, compare the results, and identify the conceptual differences in generated feature importances. We demonstrate that random forests might overlook important variables (significantly related to the response) for various reasons, while symbolic regression identifies all important variables if models of sufficient quality are found. We explain the results by the fact that variable importances obtained by these methods have different semantics.