MULTILISP: a language for concurrent symbolic computation
ACM Transactions on Programming Languages and Systems (TOPLAS)
Compilers: principles, techniques, and tools
Compilers: principles, techniques, and tools
Fundamental techniques for order optimization
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
ACM Computing Surveys (CSUR)
Advanced compiler design and implementation
Advanced compiler design and implementation
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
Access path selection in a relational database management system
SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Practical Skew Handling in Parallel Joins
VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
ICSE '81 Proceedings of the 5th international conference on Software engineering
An Efficient Framework for Order Optimization
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Interpreting the data: Parallel analysis with Sawzall
Scientific Programming - Dynamic Grids and Worldwide Computing
Map-reduce-merge: simplified relational data processing on large clusters
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Avoiding sorting and grouping in processing queries
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
A combined framework for grouping and order optimization
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets
Proceedings of the VLDB Endowment
Experiments with a symbolic evaluation system
AFIPS '76 Proceedings of the June 7-10, 1976, national computer conference and exposition
A comparison of approaches to large-scale data analysis
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Inferring Dataflow Properties of User Defined Table Processors
SAS '09 Proceedings of the 16th International Symposium on Static Analysis
Distributed aggregation for data-parallel computing: interfaces and implementations
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Hive: a warehousing solution over a map-reduce framework
Proceedings of the VLDB Endowment
State Joining and Splitting for the Symbolic Execution of Binaries
Runtime Verification
On the relative completeness of bytecode analysis versus source code analysis
CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Comet: batched stream processing for data intensive distributed computing
Proceedings of the 1st ACM symposium on Cloud computing
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Distributed threshold querying of general functions by a difference of monotonic representation
Proceedings of the VLDB Endowment
Reining in the outliers in map-reduce clusters using Mantri
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Scarlett: coping with skewed content popularity in mapreduce clusters
Proceedings of the sixth conference on Computer systems
CIEL: a universal execution engine for distributed data-flow computing
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Automatic optimization for MapReduce programs
Proceedings of the VLDB Endowment
Optimizing data partitioning for data-parallel computing
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Steno: automatic optimization of declarative queries
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Query optimization for massively parallel data processing
Proceedings of the 2nd ACM Symposium on Cloud Computing
Small cache, big effect: provable load balancing for randomly partitioned cluster services
Proceedings of the 2nd ACM Symposium on Cloud Computing
Jockey: guaranteed job latency in data parallel clusters
Proceedings of the 7th ACM european conference on Computer Systems
Re-optimizing data-parallel computing
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Spotting code optimizations in data-parallel pipelines through PeriSCOPE
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Coflow: a networking abstraction for cluster applications
Proceedings of the 11th ACM Workshop on Hot Topics in Networks
Optimus: a dynamic rewriting framework for data-parallel execution plans
Proceedings of the 8th ACM European Conference on Computer Systems
HadoopProv: towards provenance as a first class citizen in MapReduce
TaPP'13 Proceedings of the 5th USENIX conference on Theory and Practice of Provenance
Rhea: automatic filtering for unstructured cloud storage
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
HadoopProv: towards provenance as a first class citizen in MapReduce
Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance
Hi-index | 0.00 |
Map/Reduce style data-parallel computation is characterized by the extensive use of user-defined functions for data processing and relies on data-shuffling stages to prepare data partitions for parallel computation. Instead of treating user-defined functions as "black boxes", we propose to analyze those functions to turn them into "gray boxes" that expose opportunities to optimize data shuffling. We identify useful functional properties for user-defined functions, and propose SUDO, an optimization framework that reasons about data-partition properties, functional properties, and data shuffling. We have assessed this optimization opportunity on over 10,000 data-parallel programs used in production SCOPE clusters, and designed a framework that is incorporated it into the production system. Experiments with real SCOPE programs on real production data have shown that this optimization can save up to 47% in terms of disk and network I/O for shuffling, and up to 48% in terms of cross-pod network traffic.