CASH: context aware scheduler for Hadoop

Authors:
K. Arun Kumar;Vamshi Krishna Konishetty;Kaladhar Voruganti;G. V. Prabhakara Rao
Affiliations:
Sri Sathya Sai Institute of Higher Learning, Prasanthi Nilayam, India;Sri Sathya Sai Institute of Higher Learning, Prasanthi Nilayam, India;NetApp Inc., SanJose;Sri Sathya Sai Institute of Higher Learning, Prasanthi Nilayam, India
Venue:
Proceedings of the International Conference on Advances in Computing, Communications and Informatics
Year:
2012

Citing 3
Cited 0

A Dynamic MapReduce Scheduler for Heterogeneous Workloads

GCC '09 Proceedings of the 2009 Eighth International Conference on Grid and Cooperative Computing
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
SAMR: A Self-adaptive MapReduce Scheduling Algorithm in Heterogeneous Environment

CIT '10 Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hadoop MapReduce infrastructure has been designed to solve problems that can be broken down into tasks that can be solved in parallel. The key reason for MapReduce's popularity is because it can run on commodity hardware and it comes with a job scheduler and task management framework. Thus, the MapReduce framework allows the application programmers to focus on their application program and not on the management infrastructure. Job scheduler is a key component of the MapReduce framework as it controls when and where a job's tasks get executed. However, current MapReduce schedulers assume that the Hadoop cluster is homogeneous in nature. In this paper we show that making the scheduler be aware of and leverage the cluster heterogeneity can improve in the overall throughput of the system. The design of our scheduler is based on the following two key insights: 1) A large percentage of the MapReduce jobs that are run are periodic in nature. That is, these jobs execute at the same time and roughly have the same characteristics with respect to their CPU, network and disk resource requirements. 2) The nodes in a Hadoop cluster over time become heterogeneous in nature as failed and old nodes are replaced by newer ones. Thus, there is need for a 'Context Aware Scheduler for Hadoop (CASH)' which knows the context i. e. the job characteristics (CPU or I/O bound) and the resource characteristics like Computational or I/O strength of the nodes in the cluster. We have implemented CASH algorithm in both a simulator and also in a real Hadoop MapReduce cluster. We quantitatively compare CASH with the existing Hadoop FIFO scheduler and our results show significant improvement in the overall execution time of a set of MapReduce jobs. Additionally, we optimized our CASH algorithm for jobs with same working set data and showed the benefits.