Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics

Authors:
Anne-Laure Boulesteix;Silke Janitza;Jochen Kruppa;Inke R. König
Affiliations:
Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie, Ludwig-Maximilians-Universität München, München, Germany;Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie, Ludwig-Maximilians-Universität München, München, Germany;Institut für Medizinische Biometrie und Statistik, Unversität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Lübeck, Germany;Institut für Medizinische Biometrie und Statistik, Unversität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Lübeck, Germany
Venue:
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Year:
2012

Citing 16
Cited 0

Random Forests

Machine Learning
Microarray data mining with visual programming

Bioinformatics
Proteomic mass spectra classification using decision tree based ensemble methods

Bioinformatics
EM-random forest and new measures of variable importance for multi-locus quantitative trait linkage analysis

Bioinformatics
Microarray-based classification and clinical predictors

Bioinformatics
Enriched random forests

Bioinformatics
Patient-centered yes/no prognosis using learning machines

International Journal of Data Mining and Bioinformatics
Predictor correlation impacts machine learning algorithms

Bioinformatics
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Maximal conditional chi-square importance in random forests

Bioinformatics
Permutation importance

Bioinformatics
On safari to Random Jungle

Bioinformatics
Variable selection using random forests

Pattern Recognition Letters
Mining data with random forests: A survey and results of new tests

Pattern Recognition
High-dimensional pharmacogenetic prediction of a continuous trait using machine learning techniques with application to warfarin dose prediction in African Americans

Bioinformatics
Methods for Identifying SNP Interactions: A Review on Variations of Logic Regression, Random Forest and Bayesian Logistic Regression

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The random forest (RF) algorithm by Leo Breiman has become a standard data analysis tool in bioinformatics. It has shown excellent performance in settings where the number of variables is much larger than the number of observations, can cope with complex interaction structures as well as highly correlated variables and return measures of variable importance. This paper synthesizes 10 years of RF development with emphasis on applications to bioinformatics and computational biology. Special attention is paid to practical aspects such as the selection of parameters, available RF implementations, and important pitfalls and biases of RF and its variable importance measures (VIMs). The paper surveys recent developments of the methodology relevant to bioinformatics as well as some representative examples of RF applications in this context and possible directions for future research. © 2012 Wiley Periodicals, Inc. © 2012 Wiley Periodicals, Inc.