Feature elimination approach based on random forest for cancer diagnosis

  • Authors:
  • Ha-Nam Nguyen;Trung-Nghia Vu;Syng-Yup Ohn;Young-Mee Park;Mi Young Han;Chul Woo Kim

  • Affiliations:
  • Dept. of Computer and Information Engineering, Hankuk Aviation University, Seoul, Korea;Dept. of Computer and Information Engineering, Hankuk Aviation University, Seoul, Korea;Dept. of Computer and Information Engineering, Hankuk Aviation University, Seoul, Korea;Dept. of Cell Stress Biology, Roswell Park Cancer Institute, SUNY Buffalo, NY;Bioinfra Inc., Seoul, Korea;Dept. of Pathology, Tumor Immunity Medical Research Center, Seoul National University College of Medicine, Seoul, Korea

  • Venue:
  • MICAI'06 Proceedings of the 5th Mexican international conference on Artificial Intelligence
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

The performance of learning tasks is very sensitive to the characteristics of training data. There are several ways to increase the effect of learning performance including standardization, normalization, signal enhancement, linear or non-linear space embedding methods, etc. Among those methods, determining the relevant and informative features is one of the key steps in the data analysis process that helps to improve the performance, reduce the generation of data, and understand the characteristics of data. Researchers have developed the various methods to extract the set of relevant features but no one method prevails. Random Forest, which is an ensemble classifier based on the set of tree classifiers, turns out good classification performance. Taking advantage of Random Forest and using wrapper approach first introduced by Kohavi et al, we propose a new algorithm to find the optimal subset of features. The Random Forest is used to obtain the feature ranking values. And these values are applied to decide which features are eliminated in the each iteration of the algorithm. We conducted experiments with two public datasets: colon cancer and leukemia cancer. The experimental results of the real world data showed that the proposed method results in a higher prediction rate than a baseline method for certain data sets and also shows comparable and sometimes better performance than the feature selection methods widely used.