Scalable random forests for massive data

Authors:
Bingguo Li;Xiaojun Chen;Mark Junjie Li;Joshua Zhexue Huang;Shengzhong Feng
Affiliations:
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China;Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China;Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China;Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China;Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Venue:
PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Year:
2012

Citing 15
Cited 0

C4.5: programs for machine learning

C4.5: programs for machine learning
The Random Subspace Method for Constructing Decision Forests

IEEE Transactions on Pattern Analysis and Machine Intelligence
BOAT—optimistic decision tree construction

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Random Forests

Machine Learning
SLIQ: A Fast Scalable Classifier for Data Mining

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
SPRINT: A Scalable Parallel Classifier for Data Mining

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
C4.5 Decision Forests

ICPR '98 Proceedings of the 14th International Conference on Pattern Recognition-Volume 1 - Volume 1
Random decision forests

ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1
ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
A Comparison of Decision Tree Ensemble Creation Techniques

IEEE Transactions on Pattern Analysis and Machine Intelligence
Pro Hadoop

Pro Hadoop
PLANET: massively parallel learning of tree ensembles with MapReduce

Proceedings of the VLDB Endowment
A Streaming Parallel Decision Tree Algorithm

The Journal of Machine Learning Research
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
Hadoop in Action

Hadoop in Action

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a scalable random forest algorithm SRF with MapReduce implementation. A breadth-first approach is used to grow decision trees for a random forest model. At each level of the trees, a pair of map and reduce functions split the nodes. A mapper is dispatched to a local machine to compute the local histograms of subspace features of the nodes from a data block. The local histograms are submitted to reducers to compute the global histograms from which the best split conditions of the nodes are calculated and sent to the controller on the master machine to update the random forest model. A random forest model is built with a sequence of map and reduce functions. Experiments on large synthetic data have shown that SRF is scalable to the number of trees and the number of examples. The SRF algorithm is able to build a random forest of 100 trees in a little more than 1 hour from 110 Gigabyte data with 1000 features and 10 million records.