Automatic optimization of parallel dataflow programs

Authors:
Christopher Olston;Benjamin Reed;Adam Silberstein;Utkarsh Srivastava
Affiliations:
Yahoo! Research;Yahoo! Research;Yahoo! Research;Yahoo! Research
Venue:
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Year:
2008

Citing 13
Cited 19

Query optimization

ACM Computing Surveys (CSUR)
Eddies: continuously adaptive query processing

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Database Management Systems

Database Management Systems
Run-time adaptation in river

ACM Transactions on Computer Systems (TOCS)
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
Automated Selection of Materialized Views and Indexes in SQL Databases

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Data placement in shared-nothing parallel database systems

The VLDB Journal — The International Journal on Very Large Data Bases
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Adaptive query processing

Foundations and Trends in Databases
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Diamond: a storage architecture for early discard in interactive search

FAST'04 Proceedings of the 3rd USENIX conference on File and storage technologies

MapReduce optimization using regulated dynamic prioritization

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Comet: batched stream processing for data intensive distributed computing

Proceedings of the 1st ACM symposium on Cloud computing
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Proceedings of the 1st ACM symposium on Cloud computing
Towards automatic optimization of MapReduce programs

Proceedings of the 1st ACM symposium on Cloud computing
A comparison of join algorithms for log processing in MaPreduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Wave computing in the cloud

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Scripting the cloud with skywriting

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
MRShare: sharing across multiple queries in MapReduce

Proceedings of the VLDB Endowment
Nectar: automatic management of data and computation in datacenters

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Column-oriented storage techniques for MapReduce

Proceedings of the VLDB Endowment
Processing theta-joins using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Llama: leveraging columnar storage for scalable join processing in the MapReduce framework

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
CoHadoop: flexible data placement and its exploitation in Hadoop

Proceedings of the VLDB Endowment
Adaptive MapReduce using situation-aware mappers

Proceedings of the 15th International Conference on Extending Database Technology
Stubby: a transformation-based optimizer for MapReduce workflows

Proceedings of the VLDB Endowment
Spotting code optimizations in data-parallel pipelines through PeriSCOPE

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Just-in-time data distribution for analytical query processing

ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Modeling and optimizing large-scale data flows

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale parallel dataflow systems, e.g., Dryad and Map-Reduce, have attracted significant attention recently. High-level dataflow languages such as Pig Latin and Sawzall are being layered on top of these systems, to enable faster program development and more maintainable code. These languages engender greater transparency in program structure, and open up opportunities for automatic optimization. This paper proposes a set of optimization strategies for this context, drawing on and extending techniques from the database community.