Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues

Authors:
Oleg Okun;Helen Priisalu
Affiliations:
University of Oulu, Oulu 90014, Finland;Tallinn University of Technology, Tallinn 19086, Estonia
Venue:
IbPRIA '07 Proceedings of the 3rd Iberian conference on Pattern Recognition and Image Analysis, Part II
Year:
2007

Citing 6
Cited 2

Complexity Measures of Supervised Classification Problems

IEEE Transactions on Pattern Analysis and Machine Intelligence
Random Forests

Machine Learning
Outcome signature genes in breast cancer: is there a unique set?

Bioinformatics
Proteomic mass spectra classification using decision tree based ensemble methods

Bioinformatics
An introduction to ROC analysis

Pattern Recognition Letters - Special issue: ROC analysis in pattern recognition
Handbook of Parametric and Nonparametric Statistical Procedures

Handbook of Parametric and Nonparametric Statistical Procedures

Mining data with random forests: A survey and results of new tests

Pattern Recognition
Application of machine learning methods to spatial interpolation of environmental variables

Environmental Modelling & Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

Random forest is a collection (ensemble) of decision trees. It is a popular ensemble technique in pattern recognition. In this article, we apply random forest for cancer classification based on gene expression and address two issues that have been so far overlooked in other works. First, we demonstrate on two different real-world datasets that the performance of random forest is strongly influenced by dataset complexity. When estimated before running random forest, this complexity can serve as a useful performance indicator and it can explain a difference in performance on different datasets. Second, we show that one should rely with caution on feature importance used to rank genes: two forests, generated with the different number of features per node split, may have very similar classification errors on the same dataset, but the respective lists of genes ranked according to feature importance can be weakly correlated.