Scalable regression tree learning on Hadoop using OpenPlanet

Authors:
Wei Yin;Yogesh Simmhan;Viktor K. Prasanna
Affiliations:
University of Southern California, Los Angeles, CA, USA;University of Southern California, Los Angeles, CA, USA;University of Southern California, Los Angeles, CA, USA
Venue:
Proceedings of third international workshop on MapReduce and its Applications Date
Year:
2012

Citing 8
Cited 1

SECRET: a scalable linear regression tree algorithm

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
CloudBurst

Bioinformatics
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
PLANET: massively parallel learning of tree ensembles with MapReduce

Proceedings of the VLDB Endowment
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Improving Energy Use Forecast for Campus Micro-grids Using Indirect Indicators

ICDMW '11 Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops
Toward data-driven demand-response optimization in a campus microgrid

Proceedings of the Third ACM Workshop on Embedded Sensing Systems for Energy-Efficiency in Buildings

HC-CART: A parallel system implementation of data mining classification and regression tree (CART) algorithm on a multi-FPGA system

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers

Quantified Score

Hi-index	0.00

Visualization

Abstract

As scientific and engineering domains attempt to effectively analyze the deluge of data from sensors and instruments, machine learning is becoming a key data mining tool to build prediction models. Regression tree is a popular learning model that combines decision trees and linear regression to forecast numerical target variables based on a set of input features. MapReduce is well suited for such data intensive learning applications, and a proprietary regression tree algorithm, PLANET, has been proposed earlier on MapReduce. In this paper, we describe an open source implementation of this algorithm, OpenPlanet, on the Hadoop framework using a hybrid approach. We evaluate the performance of OpenPlanet using real world datasets from the Smart Power Grid domain for energy use forecasting, and propose tuning strategies of Hadoop parameters to improve the performance of the default configuration by 75% for a training dataset of 17 million tuples on a 64-core Hadoop cluster on FutureGrid.