Skew-resistant parallel processing of feature-extracting scientific user-defined functions

Authors:
YongChul Kwon;Magdalena Balazinska;Bill Howe;Jerome Rolia
Affiliations:
University of Washington, Seattle, WA, USA;University of Washington, Seattle, WA, USA;University of Washington, Seattle, WA, USA;HP Labs, Bristol, UNK, United Kingdom
Venue:
Proceedings of the 1st ACM symposium on Cloud computing
Year:
2010

Citing 21
Cited 17

A Partitioning Strategy for Nonuniform Problems on Multiprocessors

IEEE Transactions on Computers
Parallel database systems: the future of high performance database systems

Communications of the ACM
Adaptive parallel aggregation algorithms

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
PLUM: parallel load balancing for adaptive unstructured meshes

Journal of Parallel and Distributed Computing
Cost estimation of user-defined methods in object-relational database systems

ACM SIGMOD Record
Skew handling techniques in sort-merge join

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Zoltan Data Management Service for Parallel Dynamic Applications

Computing in Science and Engineering
Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Practical Skew Handling in Parallel Joins

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
AutoMate: Enabling Autonomic Applications on the Grid

Cluster Computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Adaptive query processing

Foundations and Trends in Databases
Handling data skew in parallel joins in shared-nothing systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Distributed aggregation for data-parallel computing: interfaces and implementations

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Cloud technologies for bioinformatics applications

Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
Efficient outer join data skew handling in parallel DBMS

Proceedings of the VLDB Endowment
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Towards personal high-performance geospatial computing (HPC-G): perspectives and a case study

Proceedings of the ACM SIGSPATIAL International Workshop on High Performance and Distributed Geographic Information Systems
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Hybrid merge/overlap execution technique for parallel array processing

Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases
ArrayStore: a storage manager for complex parallel array processing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Processing theta-joins using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Optimizing data partitioning for data-parallel computing

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Parallelizing large-scale data processing applications with data skew: a case study in product-offer matching

Proceedings of the second international workshop on MapReduce and its applications
The case for being lazy: how to leverage lazy evaluation in MapReduce

Proceedings of the 2nd international workshop on Scientific cloud computing
SkewTune: mitigating skew in mapreduce applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
SkewTune in action: mitigating skew in MapReduce applications

Proceedings of the VLDB Endowment
Sailfish: a framework for large scale data processing

Proceedings of the Third ACM Symposium on Cloud Computing
Themis: an I/O-efficient MapReduce

Proceedings of the Third ACM Symposium on Cloud Computing
Balancing reducer skew in MapReduce workloads using progressive sampling

Proceedings of the Third ACM Symposium on Cloud Computing
Designing good algorithms for MapReduce and beyond

Proceedings of the Third ACM Symposium on Cloud Computing
DBalancer: distributed load balancing for NoSQL data-stores

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Mammoth: autonomic data processing framework for scientific state-transition applications

Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
Balancing reducer workload for skewed data using sampling-based partitioning

Computers and Electrical Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientists today have the ability to generate data at an unprecedented scale and rate and, as a result, they must increasingly turn to parallel data processing engines to perform their analyses. However, the simple execution model of these engines can make it difficult to implement efficient algorithms for scientific analytics. In particular, many scientific analytics require the extraction of features from data represented as either a multidimensional array or points in a multidimensional space. These applications exhibit significant computational skew, where the runtime of different partitions depends on more than just input size and can therefore vary dramatically and unpredictably. In this paper, we present SkewReduce, a new system implemented on top of Hadoop that enables users to easily express feature extraction analyses and execute them efficiently. At the heart of the SkewReduce system is an optimizer, parameterized by user-defined cost functions, that determines how best to partition the input data to minimize computational skew. Experiments on real data from two different science domains demonstrate that our approach can improve execution times by a factor of up to 8 compared to a naive implementation.