Balancing reducer skew in MapReduce workloads using progressive sampling

Authors:
Smriti R. Ramakrishnan;Garret Swart;Aleksey Urmanov
Affiliations:
Oracle Corporation;Oracle Corporation;Oracle Corporation
Venue:
Proceedings of the Third ACM Symposium on Cloud Computing
Year:
2012

Citing 29
Cited 2

Using dual approximation algorithms for scheduling problems theoretical and practical results

Journal of the ACM (JACM)
A comparison of sorting algorithms for the connection machine CM-2

SPAA '91 Proceedings of the third annual ACM symposium on Parallel algorithms and architectures
Bifocal sampling for skew-resistant join size estimation

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Selectivity and cost estimation for joins based on random sampling

Journal of Computer and System Sciences
Readings in database systems (3rd ed.)

Readings in database systems (3rd ed.)
Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Parallel sorting on a shared-nothing architecture using probabilistic splitting

PDIS '91 Proceedings of the first international conference on Parallel and distributed information systems
Sampling Issues in Parallel Database Systems

EDBT '92 Proceedings of the 3rd International Conference on Extending Database Technology: Advances in Database Technology
A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Practical Skew Handling in Parallel Joins

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Sampling-Based Estimation of the Number of Distinct Values of an Attribute

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
"Balls into Bins" - A Simple and Tight Analysis

RANDOM '98 Proceedings of the Second International Workshop on Randomization and Approximation Techniques in Computer Science
Effective use of block-level sampling in statistics estimation

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Spreading the Load Using Consistent Hashing: A Preliminary Report

ISPDC '04 Proceedings of the Third International Symposium on Parallel and Distributed Computing/Third International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks
Cardinality estimation using sample views with quality assurance

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Efficient bulk insertion into a distributed ordered table

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Autonomic query parallelization using non-dedicated computers: an evaluation of adaptivity options

The VLDB Journal — The International Journal on Very Large Data Bases
Scheduling for Parallel Processing

Scheduling for Parallel Processing
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
Skew-resistant parallel processing of feature-extracting scientific user-defined functions

Proceedings of the 1st ACM symposium on Cloud computing
Principles of Distributed Database Systems

Principles of Distributed Database Systems
A batch of PNUTS: experiences connecting cloud batch and serving systems

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Small cache, big effect: provable load balancing for randomly partitioned cluster services

Proceedings of the 2nd ACM Symposium on Cloud Computing
SkewTune: mitigating skew in mapreduce applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Adaptive MapReduce using situation-aware mappers

Proceedings of the 15th International Conference on Extending Database Technology
Load Balancing for MapReduce-based Entity Resolution

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering

Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
CooMR: cross-task coordination for efficient data management in MapReduce programs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

The elapsed time of a parallel job depends on the completion time of its longest running constituent. We present a static load balancing algorithm that distributes work evenly across the reducers in a MapReduce job resulting in significant elapsed time reductions. Taking a user-specified model of reducer performance, our load balancer uses a progressive objective-based cluster sampler to estimate the load associated with each reduce-key. It balances the workload using Key Chopping, to split keys with large loads into sub-keys that can be assigned to different distributive reducers, and Key Packing, to assign keys with medium loads to reducers to minimize the maximum reducer load. Keys with small loads are hashed as they have little effect on the balance. This repeats until the user specified balancing objective and confidence level are achieved. The sampler and load balancer have been implemented in the Oracle Loader for Hadoop (OLH), a commercial MapReduce application that employs Apache Hadoop to perform parallel data formatting and data movement into partitioned relational tables. We present the performance improvements we achieve in both OLH and in a MapReduce program for inverted index creation. The balancer works for arbitrary IID key distributions, the time used for sampling is small and our solution is very effective at reducing the elapsed time for the MapReduce jobs we explored.