Analysis and correction of bias in Total Decrease in Node Impurity measures for tree-based algorithms

Authors:
Marco Sandri;Paola Zuccolotto
Affiliations:
Department of Quantitative Methods, University of Brescia, Brescia, Italy 25122;Department of Quantitative Methods, University of Brescia, Brescia, Italy 25122
Venue:
Statistics and Computing
Year:
2010

Citing 8
Cited 1

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
Technical Note: Bias in Information-Based Measures in Decision Tree Induction

Machine Learning
Random Forests

Machine Learning
Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey

Data Mining and Knowledge Discovery
Families of splitting criteria for classification trees

Statistics and Computing
A Formalism for Relevance and Its Application in Feature Subset Selection

Machine Learning
Bias Correction in Classification Tree Construction

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
On biases in estimating multi-valued attributes

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

Using random subspace method for prediction and variable importance assessment in linear regression

Computational Statistics & Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Variable selection is one of the main problems faced by data mining and machine learning techniques. These techniques are often, more or less explicitly, based on some measure of variable importance. This paper considers Total Decrease in Node Impurity (TDNI) measures, a popular class of variable importance measures defined in the field of decision trees and tree-based ensemble methods, like Random Forests and Gradient Boosting Machines. In spite of their wide use, some measures of this class are known to be biased and some correction strategies have been proposed. The aim of this paper is twofold. Firstly, to investigate the source and the characteristics of bias in TDNI measures using the notions of informative and uninformative splits. Secondly, a bias-correction algorithm, recently proposed for the Gini measure in the context of classification, is extended to the entire class of TDNI measures and its performance is investigated in the regression framework using simulated and real data.