Using dual approximation algorithms for scheduling problems theoretical and practical results
Journal of the ACM (JACM)
A comparison of sorting algorithms for the connection machine CM-2
SPAA '91 Proceedings of the third annual ACM symposium on Parallel algorithms and architectures
Bifocal sampling for skew-resistant join size estimation
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Selectivity and cost estimation for joins based on random sampling
Journal of Computer and System Sciences
Readings in database systems (3rd ed.)
Readings in database systems (3rd ed.)
Chord: A scalable peer-to-peer lookup service for internet applications
Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Parallel sorting on a shared-nothing architecture using probabilistic splitting
PDIS '91 Proceedings of the first international conference on Parallel and distributed information systems
Sampling Issues in Parallel Database Systems
EDBT '92 Proceedings of the 3rd International Conference on Extending Database Technology: Advances in Database Technology
A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins
VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Practical Skew Handling in Parallel Joins
VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Sampling-Based Estimation of the Number of Distinct Values of an Attribute
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
"Balls into Bins" - A Simple and Tight Analysis
RANDOM '98 Proceedings of the Second International Workshop on Randomization and Approximation Techniques in Computer Science
Effective use of block-level sampling in statistics estimation
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Spreading the Load Using Consistent Hashing: A Preliminary Report
ISPDC '04 Proceedings of the Third International Symposium on Parallel and Distributed Computing/Third International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks
Cardinality estimation using sample views with quality assurance
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Efficient bulk insertion into a distributed ordered table
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Autonomic query parallelization using non-dedicated computers: an evaluation of adaptivity options
The VLDB Journal — The International Journal on Very Large Data Bases
Scheduling for Parallel Processing
Scheduling for Parallel Processing
Building a high-level dataflow system on top of Map-Reduce: the Pig experience
Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework
Proceedings of the VLDB Endowment
Hadoop: The Definitive Guide
Skew-resistant parallel processing of feature-extracting scientific user-defined functions
Proceedings of the 1st ACM symposium on Cloud computing
Principles of Distributed Database Systems
Principles of Distributed Database Systems
A batch of PNUTS: experiences connecting cloud batch and serving systems
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Small cache, big effect: provable load balancing for randomly partitioned cluster services
Proceedings of the 2nd ACM Symposium on Cloud Computing
SkewTune: mitigating skew in mapreduce applications
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Adaptive MapReduce using situation-aware mappers
Proceedings of the 15th International Conference on Extending Database Technology
Load Balancing for MapReduce-based Entity Resolution
ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Distributed data management using MapReduce
ACM Computing Surveys (CSUR)
CooMR: cross-task coordination for efficient data management in MapReduce programs
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
The elapsed time of a parallel job depends on the completion time of its longest running constituent. We present a static load balancing algorithm that distributes work evenly across the reducers in a MapReduce job resulting in significant elapsed time reductions. Taking a user-specified model of reducer performance, our load balancer uses a progressive objective-based cluster sampler to estimate the load associated with each reduce-key. It balances the workload using Key Chopping, to split keys with large loads into sub-keys that can be assigned to different distributive reducers, and Key Packing, to assign keys with medium loads to reducers to minimize the maximum reducer load. Keys with small loads are hashed as they have little effect on the balance. This repeats until the user specified balancing objective and confidence level are achieved. The sampler and load balancer have been implemented in the Oracle Loader for Hadoop (OLH), a commercial MapReduce application that employs Apache Hadoop to perform parallel data formatting and data movement into partitioned relational tables. We present the performance improvements we achieve in both OLH and in a MapReduce program for inverted index creation. The balancer works for arbitrary IID key distributions, the time used for sampling is small and our solution is very effective at reducing the elapsed time for the MapReduce jobs we explored.