Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
SCOPE: easy and efficient parallel processing of massive data sets
Proceedings of the VLDB Endowment
Building a high-level dataflow system on top of Map-Reduce: the Pig experience
Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework
Proceedings of the VLDB Endowment
ParaTimer: a progress indicator for MapReduce DAGs
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
ARIA: automatic resource inference and allocation for mapreduce environments
Proceedings of the 8th ACM international conference on Autonomic computing
FLEX: a slot allocation scheduling optimizer for MapReduce workloads
Proceedings of the ACM/IFIP/USENIX 11th International Conference on Middleware
CoScan: cooperative scan sharing in the cloud
Proceedings of the 2nd ACM Symposium on Cloud Computing
Towards Optimal Resource Provisioning for Running MapReduce Programs in Public Clouds
CLOUD '11 Proceedings of the 2011 IEEE 4th International Conference on Cloud Computing
Optimization strategies for A/B testing on HADOOP
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
As cloud computing continues to mature, IT managers have started concentrating on the support of additional performance requirements: quality of service and tailored resource allocation for achieving service performance goals. In this paper, we consider the popular Pig framework that provides a high-level SQL-like abstraction on top of MapReduce engine for processing large data sets. Programs written in such frameworks are compiled into directed acyclic graphs (DAGs) of MapReduce jobs. Often, data processing applications have to produce results by a certain time deadline. We design a performance modeling framework for Pig programs that solves two inter-related problems: (i) estimating the completion time of a Pig program as a function of allocated resources, (ii) estimating the amount of resources (a number of map and reduce slots) required for completing a Pig program with a given (soft) deadline. To achieve these goals, we first, optimize a Pig program execution by enforcing the optimal schedule of its concurrent jobs. This optimization reduces a program completion time (10%-27% in our experiments), and moreover, it eliminates possible non-determinism in the DAGs execution. Based on our optimization, we propose an accurate performance model for Pig programs. This approach leads to significant resource savings (20%-60% in our experiments) compared with the original, unoptimized solution. We validate our approach in a 66-node Hadoop cluster using two workload sets: TPC-H queries and a set of customized queries mining a collection of HP Labs' web proxy logs.