The EXODUS optimizer generator
SIGMOD '87 Proceedings of the 1987 ACM SIGMOD international conference on Management of data
Querying Multiple Features of Groups in Relational Databases
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
A recursive random search algorithm for large-scale network parameter configuration
SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Optimizing ETL Processes in Data Warehouses
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Automatic optimization of parallel dataflow programs
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
A comparison of approaches to large-scale data analysis
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Distributed aggregation for data-parallel computing: interfaces and implementations
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Building a high-level dataflow system on top of Map-Reduce: the Pig experience
Proceedings of the VLDB Endowment
FlumeJava: easy, efficient data-parallel pipelines
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
MRShare: sharing across multiple queries in MapReduce
Proceedings of the VLDB Endowment
Automatic optimization for MapReduce programs
Proceedings of the VLDB Endowment
A latency and fault-tolerance optimizer for online parallel query plans
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
YSmart: Yet Another SQL-to-MapReduce Translator
ICDCS '11 Proceedings of the 2011 31st International Conference on Distributed Computing Systems
CoScan: cooperative scan sharing in the cloud
Proceedings of the 2nd ACM Symposium on Cloud Computing
Query optimization for massively parallel data processing
Proceedings of the 2nd ACM Symposium on Cloud Computing
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
The family of mapreduce and large-scale data processing systems
ACM Computing Surveys (CSUR)
Continuous cloud-scale query optimization and processing
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
There is a growing trend of performing analysis on large datasets using workflows composed of MapReduce jobs connected through producer-consumer relationships based on data. This trend has spurred the development of a number of interfaces---ranging from program-based to query-based interfaces---for generating MapReduce workflows. Studies have shown that the gap in performance can be quite large between optimized and unoptimized workflows. However, automatic cost-based optimization of MapReduce workflows remains a challenge due to the multitude of interfaces, large size of the execution plan space, and the frequent unavailability of all types of information needed for optimization. We introduce a comprehensive plan space for MapReduce workflows generated by popular workflow generators. We then propose Stubby, a cost-based optimizer that searches selectively through the subspace of the full plan space that can be enumerated correctly and costed based on the information available in any given setting. Stubby enumerates the plan space based on plan-to-plan transformations and an efficient search algorithm. Stubby is designed to be extensible to new interfaces and new types of optimizations, which is a desirable feature given how rapidly MapReduce systems are evolving. Stubby's efficiency and effectiveness have been evaluated using representative workflows from many domains.