SkewTune: mitigating skew in mapreduce applications

Authors:
YongChul Kwon;Magdalena Balazinska;Bill Howe;Jerome Rolia
Affiliations:
University of Washington, Seattle, WA, USA;University of Washington, Seattle, WA, USA;University of Washington, Seattle, WA, USA;HP Labs, Palo Alto, CA, USA
Venue:
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Year:
2012

Citing 25
Cited 20

Adaptive parallel aggregation algorithms

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Practical Skew Handling in Parallel Joins

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Estimation of Query-Result Distribution and its Application in Parallel-Join Load Balancing

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Highly available, fault-tolerant, parallel dataflows

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Clustera: an integrated computation and data management system

Proceedings of the VLDB Endowment
CloudBurst

Bioinformatics
Efficient outer join data skew handling in parallel DBMS

Proceedings of the VLDB Endowment
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
Skew-resistant parallel processing of feature-extracting scientific user-defined functions

Proceedings of the 1st ACM symposium on Cloud computing
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Proceedings of the 1st ACM symposium on Cloud computing
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Scalable clustering algorithm for N-body simulations in a shared-nothing cluster

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)

Proceedings of the VLDB Endowment
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Parallel evaluation of conjunctive queries

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A latency and fault-tolerance optimizer for online parallel query plans

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A platform for scalable one-pass analytics using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Designing good MapReduce algorithms

XRDS: Crossroads, The ACM Magazine for Students - Big Data
SkewTune in action: mitigating skew in MapReduce applications

Proceedings of the VLDB Endowment
Themis: an I/O-efficient MapReduce

Proceedings of the Third ACM Symposium on Cloud Computing
Balancing reducer skew in MapReduce workloads using progressive sampling

Proceedings of the Third ACM Symposium on Cloud Computing
Designing good algorithms for MapReduce and beyond

Proceedings of the Third ACM Symposium on Cloud Computing
Towards building a high performance spatial query system for large scale medical imaging data

Proceedings of the 20th International Conference on Advances in Geographic Information Systems
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Shark: SQL and rich analytics at scale

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Fast data in the era of big data: Twitter's real-time related query suggestion architecture

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Optimus: a dynamic rewriting framework for data-parallel execution plans

Proceedings of the 8th ACM European Conference on Computer Systems
A case for dynamic memory partitioning in data centers

Proceedings of the Second Workshop on Data Analytics in the Cloud
Upper and lower bounds on the cost of a map-reduce computation

Proceedings of the VLDB Endowment
The case for tiny tasks in compute clusters

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
SIDR: structure-aware intelligent data routing in Hadoop

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads

Proceedings of the VLDB Endowment
Scalable progressive analytics on big data in the cloud

Proceedings of the VLDB Endowment
GLB: lifeline-based global load balancing library in x10

Proceedings of the first workshop on Parallel programming for analytics applications
Parallel labeling of massive XML data with MapReduce

The Journal of Supercomputing
Balancing reducer workload for skewed data using sampling-based partitioning

Computers and Electrical Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an automatic skew mitigation approach for user-defined MapReduce programs and present SkewTune, a system that implements this approach as a drop-in replacement for an existing MapReduce implementation. There are three key challenges: (a) require no extra input from the user yet work for all MapReduce applications, (b) be completely transparent, and (c) impose minimal overhead if there is no skew. The SkewTune approach addresses these challenges and works as follows: When a node in the cluster becomes idle, SkewTune identifies the task with the greatest expected remaining processing time. The unprocessed input data of this straggling task is then proactively repartitioned in a way that fully utilizes the nodes in the cluster and preserves the ordering of the input data so that the original output can be reconstructed by concatenation. We implement SkewTune as an extension to Hadoop and evaluate its effectiveness using several real applications. The results show that SkewTune can significantly reduce job runtime in the presence of skew and adds little to no overhead in the absence of skew.