An improved partitioning mechanism for optimizing massive data analysis using MapReduce

Authors:
Kenn Slagter;Ching-Hsien Hsu;Yeh-Ching Chung;Daqiang Zhang
Affiliations:
Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan;Department of Computer Science, Chung Hua University, Hsinchu, Taiwan;Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan;School of Software Engineering, Tongji University, Shanghai, China
Venue:
The Journal of Supercomputing
Year:
2013

Citing 24
Cited 0

Burst tries: a fast, efficient data structure for string keys

ACM Transactions on Information Systems (TOIS)
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
GridBLAST: a Globus-based high-throughput implementation of BLAST in a Grid computing framework: Research Articles

Concurrency and Computation: Practice & Experience
Grid Approach to Embarrassingly Parallel CPU-Intensive Bioinformatics Problems

E-SCIENCE '06 Proceedings of the Second IEEE International Conference on e-Science and Grid Computing
Scheduling contention-free irregular redistributions in parallelizing compilers

The Journal of Supercomputing
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
On improving resource utilization and system throughput of master slave job scheduling in heterogeneous systems

The Journal of Supercomputing
MRPGA: An Extension of MapReduce for Parallelizing Genetic Algorithms

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Programming Abstractions for Data Intensive Computing on Clouds and Grids

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Detecting large-scale system problems by mining console logs

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Scheduling for atomic broadcast operation in heterogeneous networks with one port model

The Journal of Supercomputing
Scaling Genetic Algorithms Using MapReduce

ISDA '09 Proceedings of the 2009 Ninth International Conference on Intelligent Systems Design and Applications
An Analysis of Traces from a Production MapReduce Cluster

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Moving Text Analysis Tools to the Cloud

SERVICES '10 Proceedings of the 2010 6th World Congress on Services
Mochi: visual log-analysis based tools for debugging hadoop

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
A Two-Level Scheduling Strategy for optimising communications of data parallel programs in clusters

International Journal of Ad Hoc and Ubiquitous Computing
RanKloud: Scalable Multimedia Data Processing in Server Clusters

IEEE MultiMedia
DELMA: Dynamically ELastic MapReduce Framework for CPU-Intensive Applications

CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Cloud MapReduce: A MapReduce Implementation on Top of a Cloud Operating System

CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Ex-MATE: Data Intensive Computing with Large Reduction Objects and Its Application to Graph Mining

CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Dynamic Data Redistribution for MapReduce Joins

CLOUDCOM '11 Proceedings of the 2011 IEEE Third International Conference on Cloud Computing Technology and Science
Efficient selection strategies towards processor reordering techniques for improving data locality in heterogeneous clusters

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the era of Big Data, huge amounts of structured and unstructured data are being produced daily by a myriad of ubiquitous sources. Big Data is difficult to work with and requires massively parallel software running on a large number of computers. MapReduce is a recent programming model that simplifies writing distributed applications that handle Big Data. In order for MapReduce to work, it has to divide the workload among computers in a network. Consequently, the performance of MapReduce strongly depends on how evenly it distributes this workload. This can be a challenge, especially in the advent of data skew. In MapReduce, workload distribution depends on the algorithm that partitions the data. One way to avoid problems inherent from data skew is to use data sampling. How evenly the partitioner distributes the data depends on how large and representative the sample is and on how well the samples are analyzed by the partitioning mechanism. This paper proposes an improved partitioning algorithm that improves load balancing and memory consumption. This is done via an improved sampling algorithm and partitioner. To evaluate the proposed algorithm, its performance was compared against a state of the art partitioning mechanism employed by TeraSort. Experiments show that the proposed algorithm is faster, more memory efficient, and more accurate than the current implementation.