C4.5: programs for machine learning
C4.5: programs for machine learning
The Random Subspace Method for Constructing Decision Forests
IEEE Transactions on Pattern Analysis and Machine Intelligence
BOAT—optimistic decision tree construction
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Machine Learning
SLIQ: A Fast Scalable Classifier for Data Mining
EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
SPRINT: A Scalable Parallel Classifier for Data Mining
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
ICPR '98 Proceedings of the 14th International Conference on Pattern Recognition-Volume 1 - Volume 1
ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1
ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
A Comparison of Decision Tree Ensemble Creation Techniques
IEEE Transactions on Pattern Analysis and Machine Intelligence
Pro Hadoop
PLANET: massively parallel learning of tree ensembles with MapReduce
Proceedings of the VLDB Endowment
A Streaming Parallel Decision Tree Algorithm
The Journal of Machine Learning Research
Hadoop: The Definitive Guide
Hadoop in Action
Hi-index | 0.00 |
This paper proposes a scalable random forest algorithm SRF with MapReduce implementation. A breadth-first approach is used to grow decision trees for a random forest model. At each level of the trees, a pair of map and reduce functions split the nodes. A mapper is dispatched to a local machine to compute the local histograms of subspace features of the nodes from a data block. The local histograms are submitted to reducers to compute the global histograms from which the best split conditions of the nodes are calculated and sent to the controller on the master machine to update the random forest model. A random forest model is built with a sequence of map and reduce functions. Experiments on large synthetic data have shown that SRF is scalable to the number of trees and the number of examples. The SRF algorithm is able to build a random forest of 100 trees in a little more than 1 hour from 110 Gigabyte data with 1000 features and 10 million records.