Opening the black boxes in data flow optimization

Authors:
Fabian Hueske;Mathias Peters;Matthias J. Sax;Astrid Rheinländer;Rico Bergmann;Aljoscha Krettek;Kostas Tzoumas
Affiliations:
Technische Universität Berlin, Germany;Humboldt-Universität zu Berlin, Germany;Humboldt-Universität zu Berlin, Germany;Humboldt-Universität zu Berlin, Germany;Humboldt-Universität zu Berlin, Germany;Technische Universität Berlin, Germany;Technische Universität Berlin, Germany
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 21
Cited 5

Optimization techniques for queries with expensive methods

ACM Transactions on Database Systems (TODS)
Optimization of queries with user-defined predicates

ACM Transactions on Database Systems (TODS)
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Including Group-By in Query Optimization

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Compilers: Principles, Techniques, and Tools (2nd Edition)

Compilers: Principles, Techniques, and Tools (2nd Edition)
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Dynamic programming strikes back

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
FERRY: database-supported program execution

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Nephele: efficient parallel data processing in the cloud

Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions

Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Proceedings of the 1st ACM symposium on Cloud computing
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Distributed and Parallel Databases
Automatic optimization for MapReduce programs

Proceedings of the VLDB Endowment
A new, highly efficient, and easy to implement top-down join enumeration algorithm

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Hyracks: A flexible and extensible foundation for data-intensive computing

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering

Spinning fast iterative data flows

Proceedings of the VLDB Endowment
Iterative parallel data processing with stratosphere: an inside look

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Rhea: automatic filtering for unstructured cloud storage

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Tutorial: stream processing optimizations

Proceedings of the 7th ACM international conference on Distributed event-based systems
A catalog of stream processing optimizations

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many systems for big data analytics employ a data flow abstraction to define parallel data processing tasks. In this setting, custom operations expressed as user-defined functions are very common. We address the problem of performing data flow optimization at this level of abstraction, where the semantics of operators are not known. Traditionally, query optimization is applied to queries with known algebraic semantics. In this work, we find that a handful of properties, rather than a full algebraic specification, suffice to establish reordering conditions for data processing operators. We show that these properties can be accurately estimated for black box operators by statically analyzing the general-purpose code of their user-defined functions. We design and implement an optimizer for parallel data flows that does not assume knowledge of semantics or algebraic properties of operators. Our evaluation confirms that the optimizer can apply common rewritings such as selection reordering, bushy join-order enumeration, and limited forms of aggregation push-down, hence yielding similar rewriting power as modern relational DBMS optimizers. Moreover, it can optimize the operator order of nonrelational data flows, a unique feature among today's systems.