Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Quincy: fair scheduling for distributed computing clusters
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling
Proceedings of the 5th European conference on Computer systems
What is Twitter, a social network or a news media?
Proceedings of the 19th international conference on World wide web
ParaTimer: a progress indicator for MapReduce DAGs
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Visual, Log-Based Causal Tracing for Performance Debugging of MapReduce Systems
ICDCS '10 Proceedings of the 2010 IEEE 30th International Conference on Distributed Computing Systems
Towards optimizing hadoop provisioning in the cloud
HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
On availability of intermediate data in cloud computations
HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Improving MapReduce performance in heterogeneous environments
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Reining in the outliers in map-reduce clusters using Mantri
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Dynamic proportional share scheduling in Hadoop
JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
ARIA: automatic resource inference and allocation for mapreduce environments
Proceedings of the 8th ACM international conference on Autonomic computing
FLEX: a slot allocation scheduling optimizer for MapReduce workloads
Proceedings of the ACM/IFIP/USENIX 11th International Conference on Middleware
Optimizing analytic data flows for multiple execution engines
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
On modelling and prediction of total CPU usage for applications in mapreduce environments
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Optimizing and Tuning MapReduce Jobs to Improve the Large-Scale Data Analysis Process
International Journal of Intelligent Systems
CloudPack* exploiting workload flexibility through rational pricing
Proceedings of the 13th International Middleware Conference
Benchmarking approach for designing a mapreduce performance model
Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
Utility-aware deferred load balancing in the cloud driven by dynamic pricing of electricity
Proceedings of the Conference on Design, Automation and Test in Europe
A data-centric heuristic for Hadoop provisioning in the cloud
Proceedings of the 6th ACM India Computing Convention
Hybrid Analytic Flows-the Case for Optimization
Fundamenta Informaticae - Scalable Workflow Enactment Engines and Technology
Hi-index | 0.00 |
Many companies are increasingly using MapReduce for efficient large scale data processing such as personalized advertising, spam detection, and different data mining tasks. Cloud computing offers an attractive option for businesses to rent a suitable size Hadoop cluster, consume resources as a service, and pay only for resources that were utilized. One of the open questions in such environments is the amount of resources that a user should lease from the service provider. Often, a user targets specific performance goals and the application needs to complete data processing by a certain time deadline. However, currently, the task of estimating required resources to meet application performance goals is solely the users' responsibility. In this work, we introduce a novel framework and technique to address this problem and to offer a new resource sizing and provisioning service in MapReduce environments. For a MapReduce job that needs to be completed within a certain time, the job profile is built from the job past executions or by executing the application on a smaller data set using an automated profiling tool. Then, by applying scaling rules combined with a fast and efficient capacity planning model, we generate a set of resource provisioning options. Moreover, we design a model for estimating the impact of node failures on a job completion time to evaluate worst case scenarios. We validate the accuracy of our models using a set of realistic applications. The predicted completion times of generated resource provisioning options are within 10% of the measured times in our 66-node Hadoop cluster.