Relevance measures for subset variable selection in regression problems based on k-additive mutual information

  • Authors:
  • Ivan Kojadinovic

  • Affiliations:
  • LINA CNRS FRE 2729, Site école polytechnique de l'université de Nantes, Rue Christian Pauc, 44306 Nantes, France

  • Venue:
  • Computational Statistics & Data Analysis
  • Year:
  • 2005

Quantified Score

Hi-index 0.03

Visualization

Abstract

In the framework of subset variable selection for regression, relevance measures based on the notion of mutual information are studied. Results on the estimation of this index of stochastic dependence in a continuous setting are first presented. They are grounded on kernel density estimation which makes the overall estimation of the mutual information quadratic. The behavior of the mutual information as a relevance measure is then empirically studied on several regression problems. The considered problems are artificially generated to contain irrelevant and redundant candidate explanatory variables as well as strongly nonlinear relationships. Next, still in a subset variable selection context, computationally more efficient approximations of the mutual information based on the notion of k-additive truncation are proposed. The 2- and 3-additive truncations appear to be of practical interest as relevance measures. The 2-additive truncation is based on the computation of the approximate relevance of a set of potential predictors from the relevance values of the singletons and pairs it contains. The 3-additive truncation additionally involves the relevance values of the 3-element subsets. The lower the amount of redundancy among the candidate explanatory variables, the better these approximations. The sample behavior of the two resulting relevance measures is finally empirically studied on the previously generated nonlinear artificial regression problems.