Burst tries: a fast, efficient data structure for string keys
ACM Transactions on Information Systems (TOIS)
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Concurrency and Computation: Practice & Experience
Grid Approach to Embarrassingly Parallel CPU-Intensive Bioinformatics Problems
E-SCIENCE '06 Proceedings of the Second IEEE International Conference on e-Science and Grid Computing
Scheduling contention-free irregular redistributions in parallelizing compilers
The Journal of Supercomputing
Bigtable: a distributed storage system for structured data
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
The Journal of Supercomputing
MRPGA: An Extension of MapReduce for Parallelizing Genetic Algorithms
ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Programming Abstractions for Data Intensive Computing on Clouds and Grids
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Detecting large-scale system problems by mining console logs
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Scheduling for atomic broadcast operation in heterogeneous networks with one port model
The Journal of Supercomputing
Scaling Genetic Algorithms Using MapReduce
ISDA '09 Proceedings of the 2009 Ninth International Conference on Intelligent Systems Design and Applications
An Analysis of Traces from a Production MapReduce Cluster
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Moving Text Analysis Tools to the Cloud
SERVICES '10 Proceedings of the 2010 6th World Congress on Services
Mochi: visual log-analysis based tools for debugging hadoop
HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Improving MapReduce performance in heterogeneous environments
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
A Two-Level Scheduling Strategy for optimising communications of data parallel programs in clusters
International Journal of Ad Hoc and Ubiquitous Computing
RanKloud: Scalable Multimedia Data Processing in Server Clusters
IEEE MultiMedia
DELMA: Dynamically ELastic MapReduce Framework for CPU-Intensive Applications
CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Cloud MapReduce: A MapReduce Implementation on Top of a Cloud Operating System
CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Ex-MATE: Data Intensive Computing with Large Reduction Objects and Its Application to Graph Mining
CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Dynamic Data Redistribution for MapReduce Joins
CLOUDCOM '11 Proceedings of the 2011 IEEE Third International Conference on Cloud Computing Technology and Science
Hi-index | 0.00 |
In the era of Big Data, huge amounts of structured and unstructured data are being produced daily by a myriad of ubiquitous sources. Big Data is difficult to work with and requires massively parallel software running on a large number of computers. MapReduce is a recent programming model that simplifies writing distributed applications that handle Big Data. In order for MapReduce to work, it has to divide the workload among computers in a network. Consequently, the performance of MapReduce strongly depends on how evenly it distributes this workload. This can be a challenge, especially in the advent of data skew. In MapReduce, workload distribution depends on the algorithm that partitions the data. One way to avoid problems inherent from data skew is to use data sampling. How evenly the partitioner distributes the data depends on how large and representative the sample is and on how well the samples are analyzed by the partitioning mechanism. This paper proposes an improved partitioning algorithm that improves load balancing and memory consumption. This is done via an improved sampling algorithm and partitioner. To evaluate the proposed algorithm, its performance was compared against a state of the art partitioning mechanism employed by TeraSort. Experiments show that the proposed algorithm is faster, more memory efficient, and more accurate than the current implementation.