Resource provisioning framework for mapreduce jobs with performance goals

Authors:
Abhishek Verma;Ludmila Cherkasova;Roy H. Campbell
Affiliations:
University of Illinois at Urbana-Champaign;HP Labs, Palo Alto;University of Illinois at Urbana-Champaign
Venue:
Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
Year:
2011

Citing 15
Cited 8

Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Quincy: fair scheduling for distributed computing clusters

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Proceedings of the 5th European conference on Computer systems
What is Twitter, a social network or a news media?

Proceedings of the 19th international conference on World wide web
ParaTimer: a progress indicator for MapReduce DAGs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Visual, Log-Based Causal Tracing for Performance Debugging of MapReduce Systems

ICDCS '10 Proceedings of the 2010 IEEE 30th International Conference on Distributed Computing Systems
Towards optimizing hadoop provisioning in the cloud

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
On availability of intermediate data in cloud computations

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Dynamic proportional share scheduling in Hadoop

JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
ARIA: automatic resource inference and allocation for mapreduce environments

Proceedings of the 8th ACM international conference on Autonomic computing
FLEX: a slot allocation scheduling optimizer for MapReduce workloads

Proceedings of the ACM/IFIP/USENIX 11th International Conference on Middleware

Optimizing analytic data flows for multiple execution engines

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
On modelling and prediction of total CPU usage for applications in mapreduce environments

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Optimizing and Tuning MapReduce Jobs to Improve the Large-Scale Data Analysis Process

International Journal of Intelligent Systems
CloudPack* exploiting workload flexibility through rational pricing

Proceedings of the 13th International Middleware Conference
Benchmarking approach for designing a mapreduce performance model

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
Utility-aware deferred load balancing in the cloud driven by dynamic pricing of electricity

Proceedings of the Conference on Design, Automation and Test in Europe
A data-centric heuristic for Hadoop provisioning in the cloud

Proceedings of the 6th ACM India Computing Convention
Hybrid Analytic Flows-the Case for Optimization

Fundamenta Informaticae - Scalable Workflow Enactment Engines and Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many companies are increasingly using MapReduce for efficient large scale data processing such as personalized advertising, spam detection, and different data mining tasks. Cloud computing offers an attractive option for businesses to rent a suitable size Hadoop cluster, consume resources as a service, and pay only for resources that were utilized. One of the open questions in such environments is the amount of resources that a user should lease from the service provider. Often, a user targets specific performance goals and the application needs to complete data processing by a certain time deadline. However, currently, the task of estimating required resources to meet application performance goals is solely the users' responsibility. In this work, we introduce a novel framework and technique to address this problem and to offer a new resource sizing and provisioning service in MapReduce environments. For a MapReduce job that needs to be completed within a certain time, the job profile is built from the job past executions or by executing the application on a smaller data set using an automated profiling tool. Then, by applying scaling rules combined with a fast and efficient capacity planning model, we generate a set of resource provisioning options. Moreover, we design a model for estimating the impact of node failures on a job completion time to evaluate worst case scenarios. We validate the accuracy of our models using a set of realistic applications. The predicted completion times of generated resource provisioning options are within 10% of the measured times in our 66-node Hadoop cluster.