Feature elimination approach based on random forest for cancer diagnosis

Authors:
Ha-Nam Nguyen;Trung-Nghia Vu;Syng-Yup Ohn;Young-Mee Park;Mi Young Han;Chul Woo Kim
Affiliations:
Dept. of Computer and Information Engineering, Hankuk Aviation University, Seoul, Korea;Dept. of Computer and Information Engineering, Hankuk Aviation University, Seoul, Korea;Dept. of Computer and Information Engineering, Hankuk Aviation University, Seoul, Korea;Dept. of Cell Stress Biology, Roswell Park Cancer Institute, SUNY Buffalo, NY;Bioinfra Inc., Seoul, Korea;Dept. of Pathology, Tumor Immunity Medical Research Center, Seoul National University College of Medicine, Seoul, Korea
Venue:
MICAI'06 Proceedings of the 5th Mexican international conference on Artificial Intelligence
Year:
2006

Citing 12
Cited 1

Selection of relevant features and examples in machine learning

Artificial Intelligence - Special issue on relevance
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
Random Forests

Machine Learning
SLIQ: A Fast Scalable Classifier for Data Mining

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Feature selection for high-dimensional genomic microarray data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
On Feature Selection: Learning with Exponentially Many Irrelevant Features as Training Examples

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Gene Selection for Cancer Classification Using Bootstrapped Genetic Algorithms and Support Vector Machines

CSB '03 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Feature Selection for Support Vector Machines by Means of Genetic Algorithms

ICTAI '03 Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Feature selection for classifying high-dimensional numerical data

CVPR'04 Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition
Combined kernel function approach in SVM for diagnosis of cancer

ICNC'05 Proceedings of the First international conference on Advances in Natural Computation - Volume Part I

Random forests ensemble classifier trained with data resampling strategy to improve cardiac arrhythmia diagnosis

Computers in Biology and Medicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

The performance of learning tasks is very sensitive to the characteristics of training data. There are several ways to increase the effect of learning performance including standardization, normalization, signal enhancement, linear or non-linear space embedding methods, etc. Among those methods, determining the relevant and informative features is one of the key steps in the data analysis process that helps to improve the performance, reduce the generation of data, and understand the characteristics of data. Researchers have developed the various methods to extract the set of relevant features but no one method prevails. Random Forest, which is an ensemble classifier based on the set of tree classifiers, turns out good classification performance. Taking advantage of Random Forest and using wrapper approach first introduced by Kohavi et al, we propose a new algorithm to find the optimal subset of features. The Random Forest is used to obtain the feature ranking values. And these values are applied to decide which features are eliminated in the each iteration of the algorithm. We conducted experiments with two public datasets: colon cancer and leukemia cancer. The experimental results of the real world data showed that the proposed method results in a higher prediction rate than a baseline method for certain data sets and also shows comparable and sometimes better performance than the feature selection methods widely used.