Advanced partitioning techniques for massively distributed computation

Authors:
Jingren Zhou;Nicolas Bruno;Wei Lin
Affiliations:
Microsoft, Redmond, WA, USA;Microsoft, Redmond, WA, USA;Microsoft, Redmond, WA, USA
Venue:
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Year:
2012

Citing 17
Cited 1

Encapsulation of parallelism in the Volcano query processing system

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Parallel database systems: the future of high performance database systems

Communications of the ACM
Fundamental techniques for order optimization

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Query Processing in Parallel Relational Database Systems

Query Processing in Parallel Relational Database Systems
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Sort vs. Hash Revisited

IEEE Transactions on Knowledge and Data Engineering
The Volcano Optimizer Generator: Extensibility and Efficient Search

Proceedings of the Ninth International Conference on Data Engineering
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
An Efficient Framework for Order Optimization

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Avoiding sorting and grouping in processing queries

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
A combined framework for grouping and order optimization

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Dremel: interactive analysis of web-scale datasets

Proceedings of the VLDB Endowment

The data partition strategy based on hybrid range consistent hash in NoSQL database

Proceedings of the Fifth International Conference on Internet Multimedia Computing and Service

Quantified Score

Hi-index	0.00

Visualization

Abstract

An increasing number of companies rely on distributed data storage and processing over large clusters of commodity machines for critical business decisions. Although plain MapReduce systems provide several benefits, they carry certain limitations that impact developer productivity and optimization opportunities. Higher level programming languages plus conceptual data models have recently emerged to address such limitations. These languages offer a single machine programming abstraction and are able to perform sophisticated query optimization and apply efficient execution strategies. In massively distributed computation, data shuffling is typically the most expensive operation and can lead to serious performance bottlenecks if not done properly. An important optimization opportunity in this environment is that of judicious placement of repartitioning operators and choice of alternative implementations. In this paper we discuss advanced partitioning strategies, their implementation, and how they are integrated in the Microsoft Scope system. We show experimentally that our approach significantly improves performance for a large class of real-world jobs.