Encapsulation of parallelism in the Volcano query processing system
SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Parallel database systems: the future of high performance database systems
Communications of the ACM
Fundamental techniques for order optimization
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Query Processing in Parallel Relational Database Systems
Query Processing in Parallel Relational Database Systems
Access path selection in a relational database management system
SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
IEEE Transactions on Knowledge and Data Engineering
The Volcano Optimizer Generator: Extensibility and Efficient Search
Proceedings of the Ninth International Conference on Data Engineering
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
An Efficient Framework for Order Optimization
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Avoiding sorting and grouping in processing queries
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
A combined framework for grouping and order optimization
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets
Proceedings of the VLDB Endowment
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Dremel: interactive analysis of web-scale datasets
Proceedings of the VLDB Endowment
The data partition strategy based on hybrid range consistent hash in NoSQL database
Proceedings of the Fifth International Conference on Internet Multimedia Computing and Service
Hi-index | 0.00 |
An increasing number of companies rely on distributed data storage and processing over large clusters of commodity machines for critical business decisions. Although plain MapReduce systems provide several benefits, they carry certain limitations that impact developer productivity and optimization opportunities. Higher level programming languages plus conceptual data models have recently emerged to address such limitations. These languages offer a single machine programming abstraction and are able to perform sophisticated query optimization and apply efficient execution strategies. In massively distributed computation, data shuffling is typically the most expensive operation and can lead to serious performance bottlenecks if not done properly. An important optimization opportunity in this environment is that of judicious placement of repartitioning operators and choice of alternative implementations. In this paper we discuss advanced partitioning strategies, their implementation, and how they are integrated in the Microsoft Scope system. We show experimentally that our approach significantly improves performance for a large class of real-world jobs.