Stubby: a transformation-based optimizer for MapReduce workflows

Authors:
Harold Lim;Herodotos Herodotou;Shivnath Babu
Affiliations:
Duke University;Duke University;Duke University
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 17
Cited 3

The EXODUS optimizer generator

SIGMOD '87 Proceedings of the 1987 ACM SIGMOD international conference on Management of data
Querying Multiple Features of Groups in Relational Databases

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
A recursive random search algorithm for large-scale network parameter configuration

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Optimizing ETL Processes in Data Warehouses

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Automatic optimization of parallel dataflow programs

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Distributed aggregation for data-parallel computing: interfaces and implementations

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment
FlumeJava: easy, efficient data-parallel pipelines

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Automatic optimization for MapReduce programs

Proceedings of the VLDB Endowment
A latency and fault-tolerance optimizer for online parallel query plans

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
YSmart: Yet Another SQL-to-MapReduce Translator

ICDCS '11 Proceedings of the 2011 31st International Conference on Distributed Computing Systems
CoScan: cooperative scan sharing in the cloud

Proceedings of the 2nd ACM Symposium on Cloud Computing
Query optimization for massively parallel data processing

Proceedings of the 2nd ACM Symposium on Cloud Computing

Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Continuous cloud-scale query optimization and processing

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

There is a growing trend of performing analysis on large datasets using workflows composed of MapReduce jobs connected through producer-consumer relationships based on data. This trend has spurred the development of a number of interfaces---ranging from program-based to query-based interfaces---for generating MapReduce workflows. Studies have shown that the gap in performance can be quite large between optimized and unoptimized workflows. However, automatic cost-based optimization of MapReduce workflows remains a challenge due to the multitude of interfaces, large size of the execution plan space, and the frequent unavailability of all types of information needed for optimization. We introduce a comprehensive plan space for MapReduce workflows generated by popular workflow generators. We then propose Stubby, a cost-based optimizer that searches selectively through the subspace of the full plan space that can be enumerated correctly and costed based on the information available in any given setting. Stubby enumerates the plan space based on plan-to-plan transformations and an efficient search algorithm. Stubby is designed to be extensible to new interfaces and new types of optimizations, which is a desirable feature given how rapidly MapReduce systems are evolving. Stubby's efficiency and effectiveness have been evaluated using representative workflows from many domains.