Stratified sampling for feature subspace selection in random forests for high dimensional data

Authors:
Yunming Ye;Qingyao Wu;Joshua Zhexue Huang;Michael K. Ng;Xutao Li
Affiliations:
Department of Computer Science, Shenzhen Graduate School, Harbin Institute of Technology, China and Shenzhen Key Laboratory of Internet Information Collaboration, Shenzhen, China;Department of Computer Science, Shenzhen Graduate School, Harbin Institute of Technology, China and Shenzhen Key Laboratory of Internet Information Collaboration, Shenzhen, China;Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China and Shenzhen Key Laboratory of High Performance Data Mining, China;Department of Mathematics, Hong Kong Baptist University, China;Department of Computer Science, Shenzhen Graduate School, Harbin Institute of Technology, China and Shenzhen Key Laboratory of Internet Information Collaboration, Shenzhen, China
Venue:
Pattern Recognition
Year:
2013

Citing 31
Cited 1

Bagging predictors

Machine Learning
Shape quantization and recognition with randomized trees

Neural Computation
The Random Subspace Method for Constructing Decision Forests

IEEE Transactions on Pattern Analysis and Machine Intelligence
An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization

Machine Learning
From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems

Machine Learning
Random Forests

Machine Learning
An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants

Machine Learning
Random decision forests

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1
Use of the zero norm with linear models and kernel methods

The Journal of Machine Learning Research
Distinctive Image Features from Scale-Invariant Keypoints

International Journal of Computer Vision
A Statistical Approach to Texture Classification from Single Images

International Journal of Computer Vision - Special Issue on Texture Analysis and Synthesis
Randomized Trees for Real-Time Keypoint Recognition

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2 - Volume 02
Extremely randomized trees

Machine Learning
Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study

CVPRW '06 Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop
Keypoint Recognition Using Randomized Trees

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Comparison of Decision Tree Ensemble Creation Techniques

IEEE Transactions on Pattern Analysis and Machine Intelligence
Optimized stratified sampling for approximate query processing

ACM Transactions on Database Systems (TODS)
Quantile Regression Forests

The Journal of Machine Learning Research
Eigenfaces for recognition

Journal of Cognitive Neuroscience
SRDA: An Efficient Algorithm for Large-Scale Discriminant Analysis

IEEE Transactions on Knowledge and Data Engineering
An empirical evaluation of supervised learning in high dimensions

Proceedings of the 25th international conference on Machine learning
A stratified traffic sampling methodology for seeing the big picture

Computer Networks: The International Journal of Computer and Telecommunications Networking
Randomized Clustering Forests for Image Classification

IEEE Transactions on Pattern Analysis and Machine Intelligence
Consistency of Random Forests and Other Averaging Classifiers

The Journal of Machine Learning Research
Enriched random forests

Bioinformatics
On safari to Random Jungle

Bioinformatics
Stratified Sampling for Data Mining on the Deep Web

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining
On oblique random forests

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Sampling strategies for bag-of-features image classification

ECCV'06 Proceedings of the 9th European conference on Computer Vision - Volume Part IV
Statistical modeling and conceptualization of visual patterns

IEEE Transactions on Pattern Analysis and Machine Intelligence

An ensemble of decision cluster crotches for classification of high dimensional data

Knowledge-Based Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to select the feature subspaces for random forests with high dimensional data. The key idea is to stratify features into two groups. One group will contain strong informative features and the other weak informative features. Then, for feature subspace selection, we randomly select features from each group proportionally. The advantage of stratified sampling is that we can ensure that each subspace contains enough informative features for classification in high dimensional data. Testing on both synthetic data and various real data sets in gene classification, image categorization and face recognition data sets consistently demonstrates the effectiveness of this new method. The performance is shown to better that of state-of-the-art algorithms including SVM, the four variants of random forests (RF, ERT, enrich-RF, and oblique-RF), and nearest neighbor (NN) algorithms.