A new variable importance measure for random forests with missing data

Authors:
Alexander Hapfelmeier;Torsten Hothorn;Kurt Ulm;Carolin Strobl
Affiliations:
Institut für Medizinische Statistik und Epidemiologie, Technische Universität München, München, Germany 81675;Institut für Statistik, Ludwig-Maximilians-Universität, München, Germany 80539;Institut für Medizinische Statistik und Epidemiologie, Technische Universität München, München, Germany 81675;Department of Psychology, University of Zurich, Zurich, Switzerland 8050
Venue:
Statistics and Computing
Year:
2014

Citing 10
Cited 0

C4.5: programs for machine learning

C4.5: programs for machine learning
Technical Note: Bias in Information-Based Measures in Decision Tree Induction

Machine Learning
Bagging predictors

Machine Learning
Random Forests

Machine Learning
Bias Correction in Classification Tree Construction

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
The problem of disguised missing data

ACM SIGKDD Explorations Newsletter
Empirical characterization of random forest variable importance measures

Computational Statistics & Data Analysis
Consistency of Random Forests and Other Averaging Classifiers

The Journal of Machine Learning Research
Maximal conditional chi-square importance in random forests

Bioinformatics
Permutation importance

Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Random forests are widely used in many research fields for prediction and interpretation purposes. Their popularity is rooted in several appealing characteristics, such as their ability to deal with high dimensional data, complex interactions and correlations between variables. Another important feature is that random forests provide variable importance measures that can be used to identify the most important predictor variables. Though there are alternatives like complete case analysis and imputation, existing methods for the computation of such measures cannot be applied straightforward when the data contains missing values. This paper presents a solution to this pitfall by introducing a new variable importance measure that is applicable to any kind of data--whether it does or does not contain missing values. An extensive simulation study shows that the new measure meets sensible requirements and shows good variable ranking properties. An application to two real data sets also indicates that the new approach may provide a more sensible variable ranking than the widespread complete case analysis. It takes the occurrence of missing values into account which makes results also differ from those obtained under multiple imputation.