Optimization techniques for queries with expensive methods
ACM Transactions on Database Systems (TODS)
Optimization of queries with user-defined predicates
ACM Transactions on Database Systems (TODS)
Access path selection in a relational database management system
SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Including Group-By in Query Optimization
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Compilers: Principles, Techniques, and Tools (2nd Edition)
Compilers: Principles, Techniques, and Tools (2nd Edition)
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Dynamic programming strikes back
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets
Proceedings of the VLDB Endowment
FERRY: database-supported program execution
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Nephele: efficient parallel data processing in the cloud
Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework
Proceedings of the VLDB Endowment
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing
Proceedings of the 1st ACM symposium on Cloud computing
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
ASTERIX: towards a scalable, semistructured data platform for evolving-world models
Distributed and Parallel Databases
Automatic optimization for MapReduce programs
Proceedings of the VLDB Endowment
A new, highly efficient, and easy to implement top-down join enumeration algorithm
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Hyracks: A flexible and extensible foundation for data-intensive computing
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Spinning fast iterative data flows
Proceedings of the VLDB Endowment
Iterative parallel data processing with stratosphere: an inside look
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Rhea: automatic filtering for unstructured cloud storage
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Tutorial: stream processing optimizations
Proceedings of the 7th ACM international conference on Distributed event-based systems
A catalog of stream processing optimizations
ACM Computing Surveys (CSUR)
Hi-index | 0.00 |
Many systems for big data analytics employ a data flow abstraction to define parallel data processing tasks. In this setting, custom operations expressed as user-defined functions are very common. We address the problem of performing data flow optimization at this level of abstraction, where the semantics of operators are not known. Traditionally, query optimization is applied to queries with known algebraic semantics. In this work, we find that a handful of properties, rather than a full algebraic specification, suffice to establish reordering conditions for data processing operators. We show that these properties can be accurately estimated for black box operators by statically analyzing the general-purpose code of their user-defined functions. We design and implement an optimizer for parallel data flows that does not assume knowledge of semantics or algebraic properties of operators. Our evaluation confirms that the optimizer can apply common rewritings such as selection reordering, bushy join-order enumeration, and limited forms of aggregation push-down, hence yielding similar rewriting power as modern relational DBMS optimizers. Moreover, it can optimize the operator order of nonrelational data flows, a unique feature among today's systems.