A Practical Performance Model for Hadoop MapReduce

  • Authors:
  • Xuelian Lin;Zide Meng;Chuan Xu;Meng Wang

  • Affiliations:
  • -;-;-;-

  • Venue:
  • CLUSTERW '12 Proceedings of the 2012 IEEE International Conference on Cluster Computing Workshops
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

An accurate performance model for MapReduce is increasingly important for analyzing and optimizing MapReduce jobs. It is also a precondition to implement cost-based scheduling strategies or to translate Hive like query jobs into sets of low cost MapReduce jobs. However, the multiple processing steps in MapReduce task, as well as the complexity of relationships among these steps and the difficulty to measure the computational complexity of MapReduce task, greatly challenges the development and application of a precise performance model. In this paper, we define the concept of relative computational complexity of MapReduce task to estimate the complexity of task, and illustrate the way to measure it. Then, we analyze the detail composition of MapReduce tasks and relationships among them, decompose the major cost items, and present a vector style cost model with equation to calculate each cost items. Moreover, we provide equations to estimate the task execution time based on cost vectors. The experiment on several Hadoop clusters confirms the effectiveness of our proposed performance model.