Empirical characterization of random forest variable importance measures

Authors:
Kellie J. Archer;Ryan V. Kimes
Affiliations:
Department of Biostatistics, Virginia Commonwealth University, 1101 East Marshall Street, B1-066, Richmond, VA 23298-0032, USA;Department of Biostatistics, Virginia Commonwealth University, 1101 East Marshall Street, B1-066, Richmond, VA 23298-0032, USA
Venue:
Computational Statistics & Data Analysis
Year:
2008

Citing 7
Cited 8

Bagging predictors

Machine Learning
The Random Subspace Method for Constructing Decision Forests

IEEE Transactions on Pattern Analysis and Machine Intelligence
Random Forests

Machine Learning
Estimating Generalization Error on Two-Class Datasets Using Out-of-Bag Estimates

Machine Learning
Bagging and the Random Subspace Method for Redundant Feature Spaces

MCS '01 Proceedings of the Second International Workshop on Multiple Classifier Systems
Boosting, Bagging, and Consensus Based Classification of Multisource Remote Sensing Data

MCS '01 Proceedings of the Second International Workshop on Multiple Classifier Systems
Bagging tree classifiers for laser scanning images: a data- and simulation-based strategy

Artificial Intelligence in Medicine

A similarity measure to assess the stability of classification trees

Computational Statistics & Data Analysis
Taxonomy for characterizing ensemble methods in classification tasks: A review and annotated bibliography

Computational Statistics & Data Analysis
Ensemble classification based on generalized additive models

Computational Statistics & Data Analysis
Variable selection using random forests

Pattern Recognition Letters
Mining data with random forests: A survey and results of new tests

Pattern Recognition
On oblique random forests

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
A new variable selection approach using Random Forests

Computational Statistics & Data Analysis
A new variable importance measure for random forests with missing data

Statistics and Computing

Quantified Score

Hi-index	0.03

Visualization

Abstract

Microarray studies yield data sets consisting of a large number of candidate predictors (genes) on a small number of observations (samples). When interest lies in predicting phenotypic class using gene expression data, often the goals are both to produce an accurate classifier and to uncover the predictive structure of the problem. Most machine learning methods, such as k-nearest neighbors, support vector machines, and neural networks, are useful for classification. However, these methods provide no insight regarding the covariates that best contribute to the predictive structure. Other methods, such as linear discriminant analysis, require the predictor space be substantially reduced prior to deriving the classifier. A recently developed method, random forests (RF), does not require reduction of the predictor space prior to classification. Additionally, RF yield variable importance measures for each candidate predictor. This study examined the effectiveness of RF variable importance measures in identifying the true predictor among a large number of candidate predictors. An extensive simulation study was conducted using 20 levels of correlation among the predictor variables and 7 levels of association between the true predictor and the dichotomous response. We conclude that the RF methodology is attractive for use in classification problems when the goals of the study are to produce an accurate classifier and to provide insight regarding the discriminative ability of individual predictor variables. Such goals are common among microarray studies, and therefore application of the RF methodology for the purpose of obtaining variable importance measures is demonstrated on a microarray data set.