Optimal bayesian 2d-discretization for variable ranking in regression

  • Authors:
  • Marc Boullé;Carine Hue

  • Affiliations:
  • France Télécom R&D Lannion;France Télécom R&D Lannion

  • Venue:
  • DS'06 Proceedings of the 9th international conference on Discovery Science
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In supervised machine learning, variable ranking aims at sorting the input variables according to their relevance w.r.t. an output variable. In this paper, we propose a new relevance criterion for variable ranking in a regression problem with a large number of variables. This criterion comes from a discretization of both input and output variables, derived as an extension of a Bayesian non parametric discretization method for the classification case. For that, we introduce a family of discretization grid models and a prior distribution defined on this model space. For this prior, we then derive the exact Bayesian model selection criterion. The obtained most probable grid-partition of the data emphasizes the relation (or the absence of relation) between inputs and output and provides a ranking criterion for the input variables. Preliminary experiments both on synthetic and real data demonstrate the criterion capacity to select the most relevant variables and to improve a regression tree.