Dandelion: a compiler and runtime for heterogeneous systems

Authors:
Christopher J. Rossbach;Yuan Yu;Jon Currey;Jean-Philippe Martin;Dennis Fetterly
Affiliations:
Microsoft Research Silicon Valley;Microsoft Research Silicon Valley;Microsoft Research Silicon Valley;Microsoft Research Silicon Valley;Microsoft Research Silicon Valley
Venue:
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Year:
2013

Citing 73
Cited 0

The CODE 2.0 graphical parallel programming language

ICS '92 Proceedings of the 6th international conference on Supercomputing
A case for intelligent disks (IDISKs)

ACM SIGMOD Record
The click modular router

ACM Transactions on Computer Systems (TOCS)
Programming Microsoft Directshow

Programming Microsoft Directshow
P-RIO: A Modular Parallel-Programming Environment

IEEE Concurrency
Active Disks for Large-Scale Data Processing

Computer
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
TCP Offload to the Rescue

Queue - Open Source
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Processor-embedded distributed smart disks for I/O-intensive workloads: architectures, performance models and evaluation

Journal of Parallel and Distributed Computing
CellSs: a programming model for the cell BE architecture

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Programming using RapidMind on the Cell BE

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Fast computation of database operations using graphics processors

SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
FFPF: fairly fast packet filters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Data Flow Program Graphs

Computer
Ruler: high-speed packet matching and rewriting on NPUs

Proceedings of the 3rd ACM/IEEE Symposium on Architecture for networking and communications systems
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Tapping into the fountain of CPUs: on operating system support for programmable devices

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Merge: a programming model for heterogeneous multi-core systems

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Accelerating computing with the cell broadband engine processor

Proceedings of the 5th conference on Computing frontiers
Relational joins on graphics processors

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
PipesFS: fast Linux I/O in the unix tradition

ACM SIGOPS Operating Systems Review - Research and developments in the Linux kernel
Liquid Metal: Object-Oriented Programming Across the Hardware/Software Boundary

ECOOP '08 Proceedings of the 22nd European conference on Object-Oriented Programming
Mars: a MapReduce framework on graphics processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
CUDA-Lite: Reducing GPU Programming Complexity

Languages and Compilers for Parallel Computing
Predictive Runtime Code Scheduling for Heterogeneous Architectures

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
hiCUDA: a high-level directive-based language for GPU programming

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Exploiting the Cell/BE Architecture with the StarPU Unified Runtime System

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Relational query coprocessing on graphics processors

ACM Transactions on Database Systems (TODS)
Distributed aggregation for data-parallel computing: interfaces and implementations

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
FlumeJava: easy, efficient data-parallel pipelines

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Lime: a Java-compatible and synthesizable language for heterogeneous architectures

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Maestro: data orchestration and tuning for OpenCL devices

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
Data-Aware Task Scheduling on Multi-accelerator Based Platforms

ICPADS '10 Proceedings of the 2010 IEEE 16th International Conference on Parallel and Distributed Systems
Copperhead: compiling an embedded data parallel language

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Application-Tailored I/O with Streamline

ACM Transactions on Computer Systems (TOCS)
CIEL: a universal execution engine for distributed data-flow computing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
A static task partitioning approach for heterogeneous systems using OpenCL

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Processing data streams with hard real-time constraints on heterogeneous systems

Proceedings of the international conference on Supercomputing
TimeGraph: GPU scheduling for real-time multi-tasking environments

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Pegasus: coordinated scheduling for virtualized accelerator-based systems

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Productive cluster programming with OmpSs

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
PTask: operating system abstractions to manage GPUs as compute devices

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Liszt: a domain specific language for building portable mesh-based PDE solvers

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
FPL-3: towards language support for distributed packet processing

NETWORKING'05 Proceedings of the 4th IFIP-TC6 international conference on Networking Technologies, Services, and Protocols; Performance of Computer and Communication Networks; Mobile and Wireless Communication Systems
FPL-3E: towards language support for reconfigurable packet processing

SAMOS'05 Proceedings of the 5th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation
Enabling task-level scheduling on heterogeneous platforms

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Fast PageRank Computation on a GPU Cluster

PDP '12 Proceedings of the 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing
Scientific and Engineering Computing Using ATI Stream Technology

Computing in Science and Engineering
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Spinning fast iterative data flows

Proceedings of the VLDB Endowment
REX: recursive, delta-based data-centric computation

Proceedings of the VLDB Endowment
Parallel PageRank computation using GPUs

Proceedings of the Third Symposium on Information and Communication Technology
MATE-CG: A Map Reduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
Rootbeer: Seamlessly Using GPUs from Java

HPCC '12 Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems
Legion: expressing locality and independence with logical regions

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Accelerating text mining workloads in a MapReduce-based distributed GPU environment

Journal of Parallel and Distributed Computing
GPUfs: integrating a file system with GPUs

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
LINQits: big data on little clients

Proceedings of the 40th Annual International Symposium on Computer Architecture
Naiad: a timely dataflow system

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

Quantified Score

Hi-index	0.00

Visualization

Abstract

Computer systems increasingly rely on heterogeneity to achieve greater performance, scalability and energy efficiency. Because heterogeneous systems typically comprise multiple execution contexts with different programming abstractions and runtimes, programming them remains extremely challenging. Dandelion is a system designed to address this programmability challenge for data-parallel applications. Dandelion provides a unified programming model for heterogeneous systems that span diverse execution contexts including CPUs, GPUs, FPGAs, and the cloud. It adopts the .NET LINQ (Language INtegrated Query) approach, integrating data-parallel operators into general purpose programming languages such as C# and F#. It therefore provides an expressive data model and native language integration for user-defined functions, enabling programmers to write applications using standard high-level languages and development tools. Dandelion automatically and transparently distributes data-parallel portions of a program to available computing resources, including compute clusters for distributed execution and CPU and GPU cores of individual nodes for parallel execution. To enable automatic execution of .NET code on GPUs, Dandelion cross-compiles .NET code to CUDA kernels and uses the PTask runtime [85] to manage GPU execution. This paper discusses the design and implementation of Dandelion, focusing on the distributed CPU and GPU implementation. We evaluate the system using a diverse set of workloads.