Adaptive MapReduce using situation-aware mappers

Authors:
Rares Vernica;Andrey Balmin;Kevin S. Beyer;Vuk Ercegovac
Affiliations:
Hewlett-Packard Laboratories, Palo Alto, CA;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA;IBM Almaden Research Center, San Jose, CA
Venue:
Proceedings of the 15th International Conference on Extending Database Technology
Year:
2012

Citing 28
Cited 13

A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Adaptive parallel aggregation algorithms

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Parallel sorting on a shared-nothing architecture using probabilistic splitting

PDIS '91 Proceedings of the first international conference on Parallel and distributed information systems
New Algorithms for Parallelizing Relational Database Joins in the Presence of Data Skew

IEEE Transactions on Knowledge and Data Engineering
Practical Skew Handling in Parallel Joins

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Adaptive aggregation on chip multiprocessors

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Automatic optimization of parallel dataflow programs

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Proceedings of the 5th European conference on Computer systems
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Proceedings of the 1st ACM symposium on Cloud computing
Towards automatic optimization of MapReduce programs

Proceedings of the 1st ACM symposium on Cloud computing
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
A comparison of join algorithms for log processing in MaPreduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
ZooKeeper: wait-free coordination for internet-scale systems

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Spark: cluster computing with working sets

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
The performance of MapReduce: an in-depth study

Proceedings of the VLDB Endowment
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
CIEL: a universal execution engine for distributed data-flow computing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
A platform for scalable one-pass analytics using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Hyracks: A flexible and extensible foundation for data-intensive computing

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
FLEX: a slot allocation scheduling optimizer for MapReduce workloads

Proceedings of the ACM/IFIP/USENIX 11th International Conference on Middleware

Clydesdale: structured data processing on hadoop

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
SkewTune in action: mitigating skew in MapReduce applications

Proceedings of the VLDB Endowment
On the optimization of schedules for MapReduce workloads in the presence of shared scans

The VLDB Journal — The International Journal on Very Large Data Bases
Sailfish: a framework for large scale data processing

Proceedings of the Third ACM Symposium on Cloud Computing
Balancing reducer skew in MapReduce workloads using progressive sampling

Proceedings of the Third ACM Symposium on Cloud Computing
Designing good algorithms for MapReduce and beyond

Proceedings of the Third ACM Symposium on Cloud Computing
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Bisimulation reduction of big graphs on mapreduce

BNCOD'13 Proceedings of the 29th British National conference on Big Data
Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads

Proceedings of the VLDB Endowment
Next generation data analytics at IBM research

Proceedings of the VLDB Endowment
Run-time performance optimization of a BigData query language

Proceedings of the 5th ACM/SPEC international conference on Performance engineering
SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters

Journal of Parallel and Distributed Computing
A platform for eXtreme analytics

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose new adaptive runtime techniques for MapReduce that improve performance and simplify job tuning. We implement these techniques by breaking a key assumption of MapReduce that mappers run in isolation. Instead, our mappers communicate through a distributed meta-data store and are aware of the global state of the job. However, we still preserve the fault-tolerance, scalability, and programming API of MapReduce. We utilize these "situation-aware mappers" to develop a set of techniques that make MapReduce more dynamic: (a) Adaptive Mappers dynamically take multiple data partitions (splits) to amortize mapper start-up costs; (b) Adaptive Combiners improve local aggregation by maintaining a cache of partial aggregates for the frequent keys; (c) Adaptive Sampling and Partitioning sample the mapper outputs and use the obtained statistics to produce balanced partitions for the reducers. Our experimental evaluation shows that adaptive techniques provide up to 3x performance improvement, in some cases, and dramatically improve performance stability across the board.