Distributed aggregation for data-parallel computing: interfaces and implementations

Authors:
Yuan Yu;Pradeep Kumar Gunda;Michael Isard
Affiliations:
Microsoft Research, Mountain View, CA, USA;Microsoft Research, Mountain View, CA, USA;Microsoft Research, Mountain View, CA, USA
Venue:
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Year:
2009

Citing 23
Cited 33

Encapsulation of parallelism in the Volcano query processing system

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Algorithmic skeletons: structured management of parallel computation

Algorithmic skeletons: structured management of parallel computation
Parallel database systems: the future of high performance database systems

Communications of the ACM
Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
An overview of DB2 parallel edition

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Programming parallel algorithms

Communications of the ACM
An overview of query optimization in relational systems

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
The Gamma Database Machine Project

IEEE Transactions on Knowledge and Data Engineering
The POSTGRES Data Model

VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
Patterns and skeletons for parallel and distributed computing

Patterns and skeletons for parallel and distributed computing
Parallel and Distributed Haskells

Journal of Functional Programming
SQL:2003 has been published

ACM SIGMOD Record
Parallel SQL execution in Oracle 10g

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Evaluating MapReduce for Multi-core and Multiprocessor Systems

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Google's MapReduce programming model – Revisited

Science of Computer Programming
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Distributed data-parallel computing using a high-level programming language

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Skew-resistant parallel processing of feature-extracting scientific user-defined functions

Proceedings of the 1st ACM symposium on Cloud computing
Predictable time-sharing for DryadLINQ cluster

Proceedings of the 7th international conference on Autonomic computing
Scripting the cloud with skywriting

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Scalable clustering algorithm for N-body simulations in a shared-nothing cluster

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Nectar: automatic management of data and computation in datacenters

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Piccolo: building fast, distributed programs with partitioned tables

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Mesos: a platform for fine-grained resource sharing in the data center

Proceedings of the 8th USENIX conference on Networked systems design and implementation
A platform for scalable one-pass analytics using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Steno: automatic optimization of declarative queries

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Adaptive, secure, and scalable distributed data outsourcing: a vision paper

Proceedings of the 2011 workshop on Dynamic distributed data-intensive applications, programming abstractions, and systems
In-situ MapReduce for log processing

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
DOT: a matrix model for analyzing, optimizing and deploying software for big data analytics in distributed systems

Proceedings of the 2nd ACM Symposium on Cloud Computing
Fay: extensible distributed tracing from kernels to clusters

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Design patterns for scientific applications in DryadLINQ CTP

Proceedings of the second international workshop on Data intensive computing in the clouds
GLADE: a scalable framework for efficient analytics

ACM SIGOPS Operating Systems Review
In-situ MapReduce for log processing

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
NaaS: network-as-a-service in the cloud

Hot-ICE'12 Proceedings of the 2nd USENIX conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services
Camdoop: exploiting in-network aggregation for big data applications

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Optimizing data shuffling in data-parallel computation by understanding user-defined functions

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Putting a "big-data" platform to good use: training kinect

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Stubby: a transformation-based optimizer for MapReduce workflows

Proceedings of the VLDB Endowment
Fay: Extensible Distributed Tracing from Kernels to Clusters

ACM Transactions on Computer Systems (TOCS)
Spotting code optimizations in data-parallel pipelines through PeriSCOPE

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

ACM Transactions on Database Systems (TODS)
Cogset: a high performance MapReduce engine

Concurrency and Computation: Practice & Experience
Optimus: a dynamic rewriting framework for data-parallel execution plans

Proceedings of the 8th ACM European Conference on Computer Systems
Effective straggler mitigation: attack of the clones

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Tutorial: stream processing optimizations

Proceedings of the 7th ACM international conference on Distributed event-based systems
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Dandelion: a compiler and runtime for heterogeneous systems

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Memory-efficient groupby-aggregate using compressed buffer trees

Proceedings of the 4th annual Symposium on Cloud Computing
A catalog of stream processing optimizations

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data-intensive applications are increasingly designed to execute on large computing clusters. Grouped aggregation is a core primitive of many distributed programming models, and it is often the most efficient available mechanism for computations such as matrix multiplication and graph traversal. Such algorithms typically require non-standard aggregations that are more sophisticated than traditional built-in database functions such as Sum and Max. As a result, the ease of programming user-defined aggregations, and the efficiency of their implementation, is of great current interest. This paper evaluates the interfaces and implementations for user-defined aggregation in several state of the art distributed computing systems: Hadoop, databases such as Oracle Parallel Server, and DryadLINQ. We show that: the degree of language integration between user-defined functions and the high-level query language has an impact on code legibility and simplicity; the choice of programming interface has a material effect on the performance of computations; some execution plans perform better than others on average; and that in order to get good performance on a variety of workloads a system must be able to select between execution plans depending on the computation. The interface and execution plan described in the MapReduce paper, and implemented by Hadoop, are found to be among the worst-performing choices.