PLANET: massively parallel learning of tree ensembles with MapReduce

Authors:
Biswanath Panda;Joshua S. Herbach;Sugato Basu;Roberto J. Bayardo
Affiliations:
Google, Inc.;Google, Inc.;Google, Inc.;Google, Inc.
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 21
Cited 25

The nature of statistical learning theory

The nature of statistical learning theory
Bagging predictors

Machine Learning
BOAT—optimistic decision tree construction

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Random sampling techniques for space efficient online computation of order statistics of large datasets

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
The distributed boosting algorithm

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Random Forests

Machine Learning
A Survey of Methods for Scaling Up Inductive Algorithms

Data Mining and Knowledge Discovery
SLIQ: A Fast Scalable Classifier for Data Mining

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
RainForest - A Framework for Fast Decision Tree Construction of Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
SPRINT: A Scalable Parallel Classifier for Data Mining

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Parallel Classification for Data Mining on Shared-Memory Multiprocessors

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for Mining Large Datasets

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Efficient decision tree construction on streaming data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Communication Efficient Construction of Decision Trees Over Heterogeneously Distributed Data

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
An empirical comparison of supervised learning algorithms

ICML '06 Proceedings of the 23rd international conference on Machine learning
A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees

International Journal of Hybrid Intelligent Systems
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
An empirical evaluation of supervised learning in high dimensions

Proceedings of the 25th international conference on Machine learning
Predicting bounce rates in sponsored search advertisements

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Data Mining with Decision Trees: Theroy and Applications

Data Mining with Decision Trees: Theroy and Applications

Design patterns for efficient graph algorithms in MapReduce

Proceedings of the Eighth Workshop on Mining and Learning with Graphs
MapCG: writing parallel program portable between CPU and GPU

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Scalable clustering algorithm for N-body simulations in a shared-nothing cluster

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Decomposing data mining by a process-oriented execution plan

AICI'10 Proceedings of the 2010 international conference on Artificial intelligence and computational intelligence: Part I
Parallel boosted regression trees for web search ranking

Proceedings of the 20th international conference on World wide web
Efficient processing of data warehousing queries in a split execution environment

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Rapid parallel genome indexing with MapReduce

Proceedings of the second international workshop on MapReduce and its applications
An empirical study of massively parallel bayesian networks learning for sentiment extraction from unstructured text

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Temporal multi-hierarchy smoothing for estimating rates of rare events

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
CloudVista: visual cluster exploration for extreme scale data in the cloud

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management
ComMapReduce: an improvement of mapreduce with lightweight communication mechanisms

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part II
Scalable regression tree learning on Hadoop using OpenPlanet

Proceedings of third international workshop on MapReduce and its Applications Date
Intelligible models for classification and regression

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Scalable random forests for massive data

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
MapReduce algorithms for big data analysis

Proceedings of the VLDB Endowment
HC-CART: A parallel system implementation of data mining classification and regression tree (CART) algorithm on a multi-FPGA system

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Parallel approaches to machine learning-A comprehensive survey

Journal of Parallel and Distributed Computing
Training efficient tree-based models for document ranking

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Simulation of database-valued markov chains using SimSQL

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Multi-label learning with millions of labels: recommending advertiser bid phrases for web pages

Proceedings of the 22nd international conference on World Wide Web
Can we analyze big data inside a DBMS?

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
How to Improve Your Search Engine Ranking: Myths and Reality

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classification and regression tree learning on massive datasets is a common data mining task at Google, yet many state of the art tree learning algorithms require training data to reside in memory on a single machine. While more scalable implementations of tree learning have been proposed, they typically require specialized parallel computing architectures. In contrast, the majority of Google's computing infrastructure is based on commodity hardware. In this paper, we describe PLANET: a scalable distributed framework for learning tree models over large datasets. PLANET defines tree learning as a series of distributed computations, and implements each one using the MapReduce model of distributed computation. We show how this framework supports scalable construction of classification and regression trees, as well as ensembles of such models. We discuss the benefits and challenges of using a MapReduce compute cluster for tree learning, and demonstrate the scalability of this approach by applying it to a real world learning task from the domain of computational advertising.