Towards optimizing hadoop provisioning in the cloud

Authors:
Karthik Kambatla;Abhinav Pathak;Himabindu Pucha
Affiliations:
Purdue University;Purdue University;IBM Research Almaden
Venue:
HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Year:
2009

Citing 2
Cited 27

Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6

Towards automatic optimization of MapReduce programs

Proceedings of the 1st ACM symposium on Cloud computing
Topology-aware resource allocation for data-intensive workloads

Proceedings of the first ACM asia-pacific workshop on Workshop on systems
Topology-aware resource allocation for data-intensive workloads

ACM SIGCOMM Computer Communication Review
Towards improved load balancing for data intensive distributed computing

Proceedings of the 2011 ACM Symposium on Applied Computing
A hadoop-based packet trace processing tool

TMA'11 Proceedings of the Third international conference on Traffic monitoring and analysis
ARIA: automatic resource inference and allocation for mapreduce environments

Proceedings of the 8th ACM international conference on Autonomic computing
No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics

Proceedings of the 2nd ACM Symposium on Cloud Computing
Purlieus: locality-aware resource allocation for MapReduce in a cloud

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A Load-Driven Task Scheduler with Adaptive DSC for MapReduce

GREENCOM '11 Proceedings of the 2011 IEEE/ACM International Conference on Green Computing and Communications
MATE-EC2: a middleware for processing data with AWS

Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
Resource provisioning framework for mapreduce jobs with performance goals

Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
Panacea: towards holistic optimization of MapReduce applications

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Time and Cost Sensitive Data-Intensive Computing on Hybrid Clouds

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Automated profiling and resource management of pig programs for meeting service level objectives

Proceedings of the 9th international conference on Autonomic computing
AROMA: automated resource allocation and configuration of mapreduce environment in the cloud

Proceedings of the 9th international conference on Autonomic computing
Automatic task slots assignment in Hadoop MapReduce

Proceedings of the 1st Workshop on Architectures and Systems for Big Data
Bridging the tenant-provider gap in cloud services

Proceedings of the Third ACM Symposium on Cloud Computing
On modelling and prediction of total CPU usage for applications in mapreduce environments

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Resource provisioning framework for MapReduce jobs with performance goals

Proceedings of the 12th International Middleware Conference
Cumulon: optimizing statistical data analysis in the cloud

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Modeling I/O interference for data intensive distributed applications

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Building an on-demand virtual computing market in non-commercial communities

Proceedings of the 28th Annual ACM Symposium on Applied Computing
Mammoth: autonomic data processing framework for scientific state-transition applications

Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
Performance Modeling and Optimization of Deadline-Driven Pig Programs

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
A data-centric heuristic for Hadoop provisioning in the cloud

Proceedings of the 6th ACM India Computing Convention
Natjam: design and evaluation of eviction policies for supporting priorities and deadlines in mapreduce clusters

Proceedings of the 4th annual Symposium on Cloud Computing
Gunther: search-based auto-tuning of mapreduce

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Data analytics is becoming increasingly prominent in a variety of application areas ranging from extracting business intelligence to processing data from scientific studies. MapReduce programming paradigm lends itself well to these data-intensive analytics jobs, given its ability to scale-out and leverage several machines to parallely process data. In this work we argue that such MapReduce-based analytics are particularly synergistic with the pay-as-you-go model of a cloud platform. However, a key challenge facing end-users in this environment is the ability to provision MapReduce applications to minimize the incurred cost, while obtaining the best performance. This paper firstmotivates the importance of optimally provisioning a MapReduce job, and demonstrates that existing approaches can result in far from optimal provisioning. We then present a preliminary approach that improves MapReduce provisioning by analyzing and comparing resource consumption of the application at hand with a database of similar resource consumption signatures of other applications.