Automated profiling and resource management of pig programs for meeting service level objectives

  • Authors:
  • Zhuoyao Zhang;Ludmila Cherkasova;Abhishek Verma;Boon Thau Loo

  • Affiliations:
  • University of Pennsylvania, Philadelphia, PA, USA;Hewlett-Packard Labs, Palo alto, CA, USA;University of Illinois at Urbana-Champaign, Urbana, IL, USA;University of Pennsylvania, Philadelphia, PA, USA

  • Venue:
  • Proceedings of the 9th international conference on Autonomic computing
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

An increasing number of MapReduce applications associated with live business intelligence require completion time guarantees. In this paper, we consider the popular Pig framework that provides a high-level SQL-like abstraction on top of MapReduce engine for processing large data sets. Programs written in such frameworks are compiled into directed acyclic graphs (DAGs) of MapReduce jobs. There is a lack of performance models and analysis tools for automated performance management of such MapReduce jobs. We offer a performance modeling environment for Pig programs that automatically profiles jobs from the past runs and aims to solve the following inter-related problems: (i) estimating the completion time of a Pig program as a function of allocated resources; (ii) estimating the amount of resources (a number of map and reduce slots) required for completing a Pig program with a given (soft) deadline. For solving these problems, initially, we optimize a Pig program execution by enforcing the optimal schedule of its concurrent jobs. For DAGs with concurrent jobs, this optimization helps reducing the program completion time: 10%-27% in our experiments. Moreover, it eliminates possible non-determinism of concurrent jobs' execution in the Pig program, and therefore, enables a more accurate performance model for Pig programs. We validate our approach using a 66-node Hadoop cluster and a diverse set of workloads: PigMix benchmark, TPC-H queries, and customized queries mining a collection of HP Labs' web proxy logs. The proposed scheduling optimization leads to significant resource savings(20%-40% in our experiments) compared with the original, unoptimized solution, and the predicted program completion times are within 10% of the measured ones.