Adaptive parallel aggregation algorithms
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins
VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Practical Skew Handling in Parallel Joins
VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Estimation of Query-Result Distribution and its Application in Parallel-Join Load Balancing
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Highly available, fault-tolerant, parallel dataflows
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Clustera: an integrated computation and data management system
Proceedings of the VLDB Endowment
Bioinformatics
Efficient outer join data skew handling in parallel DBMS
Proceedings of the VLDB Endowment
Building a high-level dataflow system on top of Map-Reduce: the Pig experience
Proceedings of the VLDB Endowment
Skew-resistant parallel processing of feature-extracting scientific user-defined functions
Proceedings of the 1st ACM symposium on Cloud computing
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing
Proceedings of the 1st ACM symposium on Cloud computing
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Improving MapReduce performance in heterogeneous environments
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Scalable clustering algorithm for N-body simulations in a shared-nothing cluster
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
HaLoop: efficient iterative data processing on large clusters
Proceedings of the VLDB Endowment
Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)
Proceedings of the VLDB Endowment
Reining in the outliers in map-reduce clusters using Mantri
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
LEEN: Locality/Fairness-Aware Key Partitioning for MapReduce in the Cloud
CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Parallel evaluation of conjunctive queries
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A latency and fault-tolerance optimizer for online parallel query plans
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A platform for scalable one-pass analytics using MapReduce
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Designing good MapReduce algorithms
XRDS: Crossroads, The ACM Magazine for Students - Big Data
SkewTune in action: mitigating skew in MapReduce applications
Proceedings of the VLDB Endowment
Themis: an I/O-efficient MapReduce
Proceedings of the Third ACM Symposium on Cloud Computing
Balancing reducer skew in MapReduce workloads using progressive sampling
Proceedings of the Third ACM Symposium on Cloud Computing
Designing good algorithms for MapReduce and beyond
Proceedings of the Third ACM Symposium on Cloud Computing
Towards building a high performance spatial query system for large scale medical imaging data
Proceedings of the 20th International Conference on Advances in Geographic Information Systems
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Shark: SQL and rich analytics at scale
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Fast data in the era of big data: Twitter's real-time related query suggestion architecture
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Optimus: a dynamic rewriting framework for data-parallel execution plans
Proceedings of the 8th ACM European Conference on Computer Systems
A case for dynamic memory partitioning in data centers
Proceedings of the Second Workshop on Data Analytics in the Cloud
Upper and lower bounds on the cost of a map-reduce computation
Proceedings of the VLDB Endowment
The case for tiny tasks in compute clusters
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Distributed data management using MapReduce
ACM Computing Surveys (CSUR)
SIDR: structure-aware intelligent data routing in Hadoop
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads
Proceedings of the VLDB Endowment
Scalable progressive analytics on big data in the cloud
Proceedings of the VLDB Endowment
GLB: lifeline-based global load balancing library in x10
Proceedings of the first workshop on Parallel programming for analytics applications
Parallel labeling of massive XML data with MapReduce
The Journal of Supercomputing
Balancing reducer workload for skewed data using sampling-based partitioning
Computers and Electrical Engineering
Hi-index | 0.00 |
We present an automatic skew mitigation approach for user-defined MapReduce programs and present SkewTune, a system that implements this approach as a drop-in replacement for an existing MapReduce implementation. There are three key challenges: (a) require no extra input from the user yet work for all MapReduce applications, (b) be completely transparent, and (c) impose minimal overhead if there is no skew. The SkewTune approach addresses these challenges and works as follows: When a node in the cluster becomes idle, SkewTune identifies the task with the greatest expected remaining processing time. The unprocessed input data of this straggling task is then proactively repartitioned in a way that fully utilizes the nodes in the cluster and preserves the ordering of the input data so that the original output can be reconstructed by concatenation. We implement SkewTune as an extension to Hadoop and evaluate its effectiveness using several real applications. The results show that SkewTune can significantly reduce job runtime in the presence of skew and adds little to no overhead in the absence of skew.