Automated profiling and resource management of pig programs for meeting service level objectives

Authors:
Zhuoyao Zhang;Ludmila Cherkasova;Abhishek Verma;Boon Thau Loo
Affiliations:
University of Pennsylvania, Philadelphia, PA, USA;Hewlett-Packard Labs, Palo alto, CA, USA;University of Illinois at Urbana-Champaign, Urbana, IL, USA;University of Pennsylvania, Philadelphia, PA, USA
Venue:
Proceedings of the 9th international conference on Autonomic computing
Year:
2012

Citing 11
Cited 1

Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
ParaTimer: a progress indicator for MapReduce DAGs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Towards optimizing hadoop provisioning in the cloud

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
ARIA: automatic resource inference and allocation for mapreduce environments

Proceedings of the 8th ACM international conference on Autonomic computing
FLEX: a slot allocation scheduling optimizer for MapReduce workloads

Proceedings of the ACM/IFIP/USENIX 11th International Conference on Middleware
CoScan: cooperative scan sharing in the cloud

Proceedings of the 2nd ACM Symposium on Cloud Computing
Towards Optimal Resource Provisioning for Running MapReduce Programs in Public Clouds

CLOUD '11 Proceedings of the 2011 IEEE 4th International Conference on Cloud Computing

Optimization strategies for A/B testing on HADOOP

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

An increasing number of MapReduce applications associated with live business intelligence require completion time guarantees. In this paper, we consider the popular Pig framework that provides a high-level SQL-like abstraction on top of MapReduce engine for processing large data sets. Programs written in such frameworks are compiled into directed acyclic graphs (DAGs) of MapReduce jobs. There is a lack of performance models and analysis tools for automated performance management of such MapReduce jobs. We offer a performance modeling environment for Pig programs that automatically profiles jobs from the past runs and aims to solve the following inter-related problems: (i) estimating the completion time of a Pig program as a function of allocated resources; (ii) estimating the amount of resources (a number of map and reduce slots) required for completing a Pig program with a given (soft) deadline. For solving these problems, initially, we optimize a Pig program execution by enforcing the optimal schedule of its concurrent jobs. For DAGs with concurrent jobs, this optimization helps reducing the program completion time: 10%-27% in our experiments. Moreover, it eliminates possible non-determinism of concurrent jobs' execution in the Pig program, and therefore, enables a more accurate performance model for Pig programs. We validate our approach using a 66-node Hadoop cluster and a diverse set of workloads: PigMix benchmark, TPC-H queries, and customized queries mining a collection of HP Labs' web proxy logs. The proposed scheduling optimization leads to significant resource savings(20%-40% in our experiments) compared with the original, unoptimized solution, and the predicted program completion times are within 10% of the measured ones.