Optimizing and Tuning MapReduce Jobs to Improve the Large-Scale Data Analysis Process

  • Authors:
  • Wichian Premchaiswadi;Walisa Romsaiyud

  • Affiliations:
  • Graduate School of Information Technology, Siam University, Bangkok, 10160, Thailand;Graduate School of Information Technology, Siam University, Bangkok, 10160, Thailand

  • Venue:
  • International Journal of Intelligent Systems
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data-intensive applications process large volumes of data using a parallel processing method. MapReduce is a programming model designed for data-intensive applications for massive data sets and an execution framework for large-scale data processing on clusters of commodity servers. While fault tolerance, easy programming structure, and high scalability are considered strong points of MapReduce; however its configuration parameters must be fine-tuned to the specific deployment, which makes it more complex in configuration and performance. This paper explains tuning of the Hadoop configuration parameters, which directly affect MapReduce's job workflow performance under various conditions to achieve maximum performance. On the basis of the empirical data we collected, it became apparent that three main methodologies can affect the execution time of MapReduce running on cluster systems. Therefore, in this paper, we present a model that consists of three main modules: (1) Extending a data redistribution technique in order to find the high-performance nodes, (2) Utilizing the number of map/reduce slots in order to make it more efficient in terms of execution time, and (3) Developing a new hybrid routing schedule shuffle phase in order to define the scheduler task while memory management level is reduced. © 2013 Wiley Periodicals, Inc.