Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues

  • Authors:
  • Oleg Okun;Helen Priisalu

  • Affiliations:
  • University of Oulu, Oulu 90014, Finland;Tallinn University of Technology, Tallinn 19086, Estonia

  • Venue:
  • IbPRIA '07 Proceedings of the 3rd Iberian conference on Pattern Recognition and Image Analysis, Part II
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Random forest is a collection (ensemble) of decision trees. It is a popular ensemble technique in pattern recognition. In this article, we apply random forest for cancer classification based on gene expression and address two issues that have been so far overlooked in other works. First, we demonstrate on two different real-world datasets that the performance of random forest is strongly influenced by dataset complexity. When estimated before running random forest, this complexity can serve as a useful performance indicator and it can explain a difference in performance on different datasets. Second, we show that one should rely with caution on feature importance used to rank genes: two forests, generated with the different number of features per node split, may have very similar classification errors on the same dataset, but the respective lists of genes ranked according to feature importance can be weakly correlated.