Distributed data-parallel computing using a high-level programming language

Authors:
Michael Isard;Yuan Yu
Affiliations:
Microsoft Research, Mountain View, CA, USA;Microsoft Research, Mountain View, CA, USA
Venue:
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Year:
2009

Citing 12
Cited 20

Encapsulation of parallelism in the Volcano query processing system

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Parallel database systems: the future of high performance database systems

Communications of the ACM
Prototyping Bubba, A Highly Parallel Database System

IEEE Transactions on Knowledge and Data Engineering
The Gamma Database Machine Project

IEEE Transactions on Knowledge and Data Engineering
Parallel SQL execution in Oracle 10g

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A query language for data parallel programming: invited talk

Proceedings of the 2007 workshop on Declarative aspects of multicore programming
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Architecture of a Database System

Foundations and Trends in Databases
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Distributed aggregation for data-parallel computing: interfaces and implementations

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
MRAP: a novel MapReduce-based framework to support HPC analytics applications with access patterns

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Weaver: integrating distributed computing abstractions into scientific workflows using Python

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A domain-specific approach to heterogeneous parallelism

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
GBASE: a scalable and general graph management system

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
Provenance for MapReduce-based data-intensive workflows

Proceedings of the 6th workshop on Workflows in support of large-scale science
GLADE: a scalable framework for efficient analytics

ACM SIGOPS Operating Systems Review
GLADE: big data analytics made easy

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Clydesdale: structured data processing on MapReduce

Proceedings of the 15th International Conference on Extending Database Technology
An optimization framework for map-reduce queries

Proceedings of the 15th International Conference on Extending Database Technology
Putting a "big-data" platform to good use: training kinect

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Using Pig as a data preparation language for large-scale mining software repositories studies: An experience report

Journal of Systems and Software
gbase: an efficient analysis platform for large graphs

The VLDB Journal — The International Journal on Very Large Data Bases
Scripting distributed scientific workflows using Weaver

Concurrency and Computation: Practice & Experience
Shark: SQL and rich analytics at scale

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
SIDR: structure-aware intelligent data routing in Hadoop

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Turning nondeterminism into parallelism

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
Composition and reuse with compiled domain-specific languages

ECOOP'13 Proceedings of the 27th European conference on Object-Oriented Programming
Piranha: optimizing short jobs in Hadoop

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Dryad and DryadLINQ systems offer a new programming model for large scale data-parallel computing. They generalize previous execution environments such as SQL and MapReduce in three ways: by providing a general-purpose distributed execution engine for data-parallel applications; by adopting an expressive data model of strongly typed .NET objects; and by supporting general-purpose imperative and declarative operations on datasets within a traditional high-level programming language. A DryadLINQ program is a sequential program composed of LINQ expressions performing arbitrary side-effect-free operations on datasets, and can be written and debugged using standard .NET development tools. The DryadLINQ system automatically and transparently translates the data-parallel portions of the program into a distributed execution plan which is passed to the Dryad execution platform. Dryad, which has been in continuous operation for several years on production clusters made up of thousands of computers, ensures efficient, reliable execution of this plan on a large compute cluster. This paper describes the programming model, provides a high-level overview of the design and implementation of the Dryad and DryadLINQ systems, and discusses the tradeoffs and connections to parallel and distributed databases.