Classifying Very High-Dimensional Data with Random Forests Built from Small Subspaces

Authors:
Baoxun Xu;Joshua Zhexue Huang;Graham Williams;Qiang Wang;Yunming Ye
Affiliations:
Harbin Institute of Technology Shenzhen Graduate School, China;Shenzhen Institutes of Advanced Technology and Chinese Academy of Sciences, China;Shenzhen Institutes of Advanced Technology, and Chinese Academy of Sciences, China;Harbin Institute of Technology Shenzhen Graduate School, China;Harbin Institute of Technology Shenzhen Graduate School, China
Venue:
International Journal of Data Warehousing and Mining
Year:
2012

Citing 13
Cited 0

WebACE: a Web agent for document categorization and exploration

AGENTS '98 Proceedings of the second international conference on Autonomous agents
The Random Subspace Method for Constructing Decision Forests

IEEE Transactions on Pattern Analysis and Machine Intelligence
An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization

Machine Learning
Random Forests

Machine Learning
Centroid-Based Document Classification: Analysis and Experimental Results

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
Random decision forests

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1
Prediction of protein--protein interactions using random decision forest framework

Bioinformatics
Pathway analysis using random forests classification and regression

Bioinformatics
A Comparison of Decision Tree Ensemble Creation Techniques

IEEE Transactions on Pattern Analysis and Machine Intelligence
Enriched random forests

Bioinformatics
Improved use of continuous attributes in C4.5

Journal of Artificial Intelligence Research
An Efficient Method for Discretizing Continuous Attributes

International Journal of Data Warehousing and Mining
Dimensionality Reduction with Unsupervised Feature Selection and Applying Non-Euclidean Norms for Classification Accuracy

International Journal of Data Warehousing and Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

The selection of feature subspaces for growing decision trees is a key step in building random forest models. However, the common approach using randomly sampling a few features in the subspace is not suitable for high dimensional data consisting of thousands of features, because such data often contains many features which are uninformative to classification, and the random sampling often doesn't include informative features in the selected subspaces. Consequently, classification performance of the random forest model is significantly affected. In this paper, the authors propose an improved random forest method which uses a novel feature weighting method for subspace selection and therefore enhances classification performance over high-dimensional data. A series of experiments on 9 real life high dimensional datasets demonstrated that using a subspace size of features where M is the total number of features in the dataset, our random forest model significantly outperforms existing random forest models.