CASH: context aware scheduler for Hadoop

  • Authors:
  • K. Arun Kumar;Vamshi Krishna Konishetty;Kaladhar Voruganti;G. V. Prabhakara Rao

  • Affiliations:
  • Sri Sathya Sai Institute of Higher Learning, Prasanthi Nilayam, India;Sri Sathya Sai Institute of Higher Learning, Prasanthi Nilayam, India;NetApp Inc., SanJose;Sri Sathya Sai Institute of Higher Learning, Prasanthi Nilayam, India

  • Venue:
  • Proceedings of the International Conference on Advances in Computing, Communications and Informatics
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Hadoop MapReduce infrastructure has been designed to solve problems that can be broken down into tasks that can be solved in parallel. The key reason for MapReduce's popularity is because it can run on commodity hardware and it comes with a job scheduler and task management framework. Thus, the MapReduce framework allows the application programmers to focus on their application program and not on the management infrastructure. Job scheduler is a key component of the MapReduce framework as it controls when and where a job's tasks get executed. However, current MapReduce schedulers assume that the Hadoop cluster is homogeneous in nature. In this paper we show that making the scheduler be aware of and leverage the cluster heterogeneity can improve in the overall throughput of the system. The design of our scheduler is based on the following two key insights: 1) A large percentage of the MapReduce jobs that are run are periodic in nature. That is, these jobs execute at the same time and roughly have the same characteristics with respect to their CPU, network and disk resource requirements. 2) The nodes in a Hadoop cluster over time become heterogeneous in nature as failed and old nodes are replaced by newer ones. Thus, there is need for a 'Context Aware Scheduler for Hadoop (CASH)' which knows the context i. e. the job characteristics (CPU or I/O bound) and the resource characteristics like Computational or I/O strength of the nodes in the cluster. We have implemented CASH algorithm in both a simulator and also in a real Hadoop MapReduce cluster. We quantitatively compare CASH with the existing Hadoop FIFO scheduler and our results show significant improvement in the overall execution time of a set of MapReduce jobs. Additionally, we optimized our CASH algorithm for jobs with same working set data and showed the benefits.