SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters

Authors:
Rong Gu;Xiaoliang Yang;Jinshuang Yan;Yuanhao Sun;Bing Wang;Chunfeng Yuan;Yihua Huang
Affiliations:
-;-;-;-;-;-;-
Venue:
Journal of Parallel and Distributed Computing
Year:
2014

Citing 20
Cited 0

Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
Speeding Up Distributed MapReduce Applications Using Hardware Accelerators

ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
Accelerating MapReduce with Distributed Memory Cache

ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems
Towards automatic optimization of MapReduce programs

Proceedings of the 1st ACM symposium on Cloud computing
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
A load-aware scheduler for MapReduce framework in heterogeneous cloud environments

Proceedings of the 2011 ACM Symposium on Applied Computing
A platform for scalable one-pass analytics using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
YSmart: Yet Another SQL-to-MapReduce Translator

ICDCS '11 Proceedings of the 2011 31st International Conference on Distributed Computing Systems
The Case for Evaluating MapReduce Performance Using Workload Suites

MASCOTS '11 Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems
Hadoop acceleration through network levitated merge

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A Load-Driven Task Scheduler with Adaptive DSC for MapReduce

GREENCOM '11 Proceedings of the 2011 IEEE/ACM International Conference on Green Computing and Communications
Locality-Aware Reduce Task Scheduling for MapReduce

CLOUDCOM '11 Proceedings of the 2011 IEEE Third International Conference on Cloud Computing Technology and Science
Job Aware Scheduling Algorithm for MapReduce Framework

CLOUDCOM '11 Proceedings of the 2011 IEEE Third International Conference on Cloud Computing Technology and Science
Matchmaking: A New MapReduce Scheduling Technique

CLOUDCOM '11 Proceedings of the 2011 IEEE Third International Conference on Cloud Computing Technology and Science
Adaptive MapReduce using situation-aware mappers

Proceedings of the 15th International Conference on Extending Database Technology
An Implementation of GPU Accelerated MapReduce: Using Hadoop with OpenCL for Data- and Compute-Intensive Jobs

IJCSS '12 Proceedings of the 2012 International Joint Conference on Service Sciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

As a widely-used parallel computing framework for big data processing today, the Hadoop MapReduce framework puts more emphasis on high-throughput of data than on low-latency of job execution. However, today more and more big data applications developed with MapReduce require quick response time. As a result, improving the performance of MapReduce jobs, especially for short jobs, is of great significance in practice and has attracted more and more attentions from both academia and industry. A lot of efforts have been made to improve the performance of Hadoop from job scheduling or job parameter optimization level. In this paper, we explore an approach to improve the performance of the Hadoop MapReduce framework by optimizing the job and task execution mechanism. First of all, by analyzing the job and task execution mechanism in MapReduce framework we reveal two critical limitations to job execution performance. Then we propose two major optimizations to the MapReduce job and task execution mechanisms: first, we optimize the setup and cleanup tasks of a MapReduce job to reduce the time cost during the initialization and termination stages of the job; second, instead of adopting the loose heartbeat-based communication mechanism to transmit all messages between the JobTracker and TaskTrackers, we introduce an instant messaging communication mechanism for accelerating performance-sensitive task scheduling and execution. Finally, we implement SHadoop, an optimized and fully compatible version of Hadoop that aims at shortening the execution time cost of MapReduce jobs, especially for short jobs. Experimental results show that compared to the standard Hadoop, SHadoop can achieve stable performance improvement by around 25% on average for comprehensive benchmarks without losing scalability and speedup. Our optimization work has passed a production-level test in Intel and has been integrated into the Intel Distributed Hadoop (IDH). To the best of our knowledge, this work is the first effort that explores on optimizing the execution mechanism inside map/reduce tasks of a job. The advantage is that it can complement job scheduling optimizations to further improve the job execution performance.