Separating the wheat from the chaff: on feature selection and feature importance in regression random forests and symbolic regression

  • Authors:
  • Sean Stijven;Wouter Minnebo;Katya Vladislavleva

  • Affiliations:
  • University of Antwerp, Antwerp, Belgium;University of Antwerp, Antwerp, Belgium;University of Antwerp, Antwerp, Belgium

  • Venue:
  • Proceedings of the 13th annual conference companion on Genetic and evolutionary computation
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Feature selection in high-dimensional data sets is an open problem with no universal satisfactory method available. In this paper we discuss the requirements for such a method with respect to the various aspects of feature importance and explore them using regression random forests and symbolic regression. We study 'conventional' feature selection with both methods on several test problems and a case study, compare the results, and identify the conceptual differences in generated feature importances. We demonstrate that random forests might overlook important variables (significantly related to the response) for various reasons, while symbolic regression identifies all important variables if models of sufficient quality are found. We explain the results by the fact that variable importances obtained by these methods have different semantics.