Sponge: portable stream programming on graphics engines

Authors:
Amir H. Hormati;Mehrzad Samadi;Mark Woh;Trevor Mudge;Scott Mahlke
Affiliations:
University of Michigan, Ann Arbor, MI, USA;University of Michigan, Ann Arbor, MI, USA;University of Michigan, Ann Arbor, MI, USA;University of Michigan, Ann Arbor, MI, USA;University of Michigan, Ann Arbor, MI, USA
Venue:
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Year:
2011

Citing 25
Cited 19

Compile-Time Scheduling and Assignment of Data-Flow Program Graphs with Data-Dependent Iteration

IEEE Transactions on Computers
A stream compiler for communication-exposed architectures

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Cg: a system for programming graphics hardware in a C-like language

ACM SIGGRAPH 2003 Papers
A Hierarchical Multiprocessor Scheduling Framework for

A Hierarchical Multiprocessor Scheduling Framework for
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Data and Computation Transformations for Brook Streaming Applications on Multiprocessors

Proceedings of the International Symposium on Code Generation and Optimization
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Orchestrating the execution of stream programs on multicore platforms

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Optimus: efficient realization of streaming applications on FPGAs

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
CUDA-Lite: Reducing GPU Programming Complexity

Languages and Compilers for Parallel Computing
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs

Languages and Compilers for Parallel Computing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Software Pipelined Execution of Stream Programs on GPUs

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture
Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
MacroSS: macro-SIMDization of streaming applications

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
An empirical characterization of stream programs and its implications for language and compiler design

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
hiCUDA: High-Level GPGPU Programming

IEEE Transactions on Parallel and Distributed Systems

PTask: operating system abstractions to manage GPUs as compute devices

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Scalable framework for mapping streaming applications onto multi-GPU systems

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Unrolling and retiming of stream applications onto embedded multicore processors

Proceedings of the 49th Annual Design Automation Conference
Compiling a high-level language for GPUs: (via language support for architectures and compilers)

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Adaptive input-aware compilation for graphics engines

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
One stone two birds: synchronization relaxation and redundancy removal in GPU-CPU translation

Proceedings of the 26th ACM international conference on Supercomputing
Performance models for asynchronous data transfers on consumer Graphics Processing Units

Journal of Parallel and Distributed Computing
Softshell: dynamic scheduling on GPUs

ACM Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH Asia 2012
GPUstore: harnessing GPU computing for storage systems in the OS kernel

Proceedings of the 5th Annual International Systems and Storage Conference
StreamTMC: Stream compilation for tiled multi-core architectures

Journal of Parallel and Distributed Computing
Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Input-aware auto-tuning for directive-based GPU programming

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Scaling large-data computations on multi-GPU accelerators

Proceedings of the 27th international ACM conference on International conference on supercomputing
Accelerating Habanero-Java programs with OpenCL generation

Proceedings of the 2013 International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools
A large-scale cross-architecture evaluation of thread-coarsening

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Exploring hybrid memory for GPU energy efficiency through software-hardware co-design

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
DANBI: dynamic scheduling of irregular stream programs for many-core systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
GPU code generation for ODE-based applications with phased shared-data access patterns

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Graphics processing units (GPUs) provide a low cost platform for accelerating high performance computations. The introduction of new programming languages, such as CUDA and OpenCL, makes GPU programming attractive to a wide variety of programmers. However, programming GPUs is still a cumbersome task for two primary reasons: tedious performance optimizations and lack of portability. First, optimizing an algorithm for a specific GPU is a time-consuming task that requires a thorough understanding of both the algorithm and the underlying hardware. Unoptimized CUDA programs typically only achieve a small fraction of the peak GPU performance. Second, GPU code lacks efficient portability as code written for one GPU can be inefficient when executed on another. Moving code from one GPU to another while maintaining the desired performance is a non-trivial task often requiring significant modifications to account for the hardware differences. In this work, we propose Sponge, a compilation framework for GPUs using synchronous data flow streaming languages. Sponge is capable of performing a wide variety of optimizations to generate efficient code for graphics engines. Sponge alleviates the problems associated with current GPU programming methods by providing portability across different generations of GPUs and CPUs, and a better abstraction of the hardware details, such as the memory hierarchy and threading model. Using streaming, we provide a write-once software paradigm and rely on the compiler to automatically create optimized CUDA code for a wide variety of GPU targets. Sponge's compiler optimizations improve the performance of the baseline CUDA implementations by an average of 3.2x.