Optimizing and Tuning MapReduce Jobs to Improve the Large-Scale Data Analysis Process

Authors:
Wichian Premchaiswadi;Walisa Romsaiyud
Affiliations:
Graduate School of Information Technology, Siam University, Bangkok, 10160, Thailand;Graduate School of Information Technology, Siam University, Bangkok, 10160, Thailand
Venue:
International Journal of Intelligent Systems
Year:
2013

Citing 22
Cited 0

Array decompositions for nonuniform computational environments

Journal of Parallel and Distributed Computing
A Framework for Integrating Data Alignment, Distribution, and Redistribution in Distributed Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
High Performance Cluster Computing: Architectures and Systems

High Performance Cluster Computing: Architectures and Systems
Main Memory-Based Algorithms for Efficient Parallel Aggregation for Temporal Databases

Distributed and Parallel Databases
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Pro Hadoop

Pro Hadoop
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
Accelerating MapReduce with Distributed Memory Cache

ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems
Optimizing joins in a map-reduce environment

Proceedings of the 13th International Conference on Extending Database Technology
A comparison of join algorithms for log processing in MaPreduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
SAMR: A Self-adaptive MapReduce Scheduling Algorithm in Heterogeneous Environment

CIT '10 Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology
The performance of MapReduce: an in-depth study

Proceedings of the VLDB Endowment
Filtering: a method for solving graph problems in MapReduce

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Toward Efficient and Simplified Distributed Data Intensive Computing

IEEE Transactions on Parallel and Distributed Systems
Distributed tuning of machine learning algorithms using MapReduce Clusters

Proceedings of the Third Workshop on Large Scale Data Mining: Theory and Applications
Data Replication in Data Intensive Scientific Applications with Performance Guarantee

IEEE Transactions on Parallel and Distributed Systems
MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters

IEEE Transactions on Knowledge and Data Engineering
Resource provisioning framework for mapreduce jobs with performance goals

Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
MapReduce indexing strategies: Studying scalability and efficiency

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data-intensive applications process large volumes of data using a parallel processing method. MapReduce is a programming model designed for data-intensive applications for massive data sets and an execution framework for large-scale data processing on clusters of commodity servers. While fault tolerance, easy programming structure, and high scalability are considered strong points of MapReduce; however its configuration parameters must be fine-tuned to the specific deployment, which makes it more complex in configuration and performance. This paper explains tuning of the Hadoop configuration parameters, which directly affect MapReduce's job workflow performance under various conditions to achieve maximum performance. On the basis of the empirical data we collected, it became apparent that three main methodologies can affect the execution time of MapReduce running on cluster systems. Therefore, in this paper, we present a model that consists of three main modules: (1) Extending a data redistribution technique in order to find the high-performance nodes, (2) Utilizing the number of map/reduce slots in order to make it more efficient in terms of execution time, and (3) Developing a new hybrid routing schedule shuffle phase in order to define the scheduler task while memory management level is reduced. © 2013 Wiley Periodicals, Inc.