Optimizing data partitioning for data-parallel computing

Authors:
Qifa Ke;Vijayan Prabhakaran;Yinglian Xie;Yuan Yu;Jingyue Wu;Junfeng Yang
Affiliations:
Microsoft Research Silicon Valley;Microsoft Research Silicon Valley;Microsoft Research Silicon Valley;Microsoft Research Silicon Valley;Columbia University;Columbia University
Venue:
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Year:
2011

Citing 15
Cited 7

Random sampling for histogram construction: how much is enough?

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Practical Skew Handling in Parallel Joins

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Convex Optimization

Convex Optimization
Predictive Resource Management for Wearable Computing

Proceedings of the 1st international conference on Mobile systems, applications and services
On synopses for distinct-value estimation under multiset operations

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Measuring empirical computational complexity

Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Finding frequent items in data streams

Proceedings of the VLDB Endowment
SPEED: precise and efficient static estimation of program computational complexity

Proceedings of the 36th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
BotGraph: large scale spamming botnet detection

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Skew-resistant parallel processing of feature-extracting scientific user-defined functions

Proceedings of the 1st ACM symposium on Cloud computing
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Re-optimizing data-parallel computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Optimizing data shuffling in data-parallel computation by understanding user-defined functions

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Spotting code optimizations in data-parallel pipelines through PeriSCOPE

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Optimus: a dynamic rewriting framework for data-parallel execution plans

Proceedings of the 8th ACM European Conference on Computer Systems
Mammoth: autonomic data processing framework for scientific state-transition applications

Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads

Proceedings of the VLDB Endowment
FENNEL: streaming graph partitioning for massive scale graphs

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Performance of data-parallel computing (e.g., MapReduce, DryadLINQ) heavily depends on its data partitions. Solutions implemented by the current state of the art systems are far from optimal. Techniques proposed by the database community to find optimal data partitions are not directly applicable when complex user-defined functions and data models are involved. We outline our solution, which draws expertise from various fields such as programming languages and optimization, and present our preliminary results.