Software Pipelined Execution of Stream Programs on GPUs

Authors:
Abhishek Udupa;R. Govindarajan;Matthew J. Thazhuthaveetil
Affiliations:
-;-;-
Venue:
Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Year:
2009

Citing 16
Cited 24

Static scheduling of synchronous data flow programs for digital signal processing

IEEE Transactions on Computers
Code generation schema for modulo scheduled loops

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Minimizing register requirements under resource-constrained rate-optimal software pipelining

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Looped schedules for dataflow descriptions of multirate signal processing algorithms

Formal Methods in System Design
A stream compiler for communication-exposed architectures

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Phased scheduling of stream programs

Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
Buffer merging—a powerful technique for reducing memory requirements of synchronous dataflow specifications

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Optimizing stream programs using linear state space analysis

Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Accelerator: using data parallelism to program GPUs for general-purpose uses

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Orchestrating the execution of stream programs on multicore platforms

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
A lightweight streaming layer for multicore execution

ACM SIGARCH Computer Architecture News

Synergistic execution of stream programs on multicores with accelerators

Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Compiling Python to a hybrid execution environment

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Minimizing communication in rate-optimal software pipelining for stream programs

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Partitioning streaming parallelism for multi-cores: a machine learning based approach

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
An empirical characterization of stream programs and its implications for language and compiler design

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Orchestration by approximation: mapping stream programs onto multicore architectures

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Sponge: portable stream programming on graphics engines

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Scalable framework for mapping streaming applications onto multi-GPU systems

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Profile-guided deployment of stream programs on multicores

Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Compiling a high-level language for GPUs: (via language support for architectures and compilers)

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Adaptive input-aware compilation for graphics engines

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
StreamPI: a stream-parallel programming extension for object-oriented programming languages

The Journal of Supercomputing
Automatic CUDA code synthesis framework for multicore CPU and GPU architectures

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Automatic generation of software pipelines for heterogeneous parallel systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Sigma*: symbolic learning of input-output specifications

POPL '13 Proceedings of the 40th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
StreamTMC: Stream compilation for tiled multi-core architectures

Journal of Parallel and Distributed Computing
Optimizing tensor contraction expressions for hybrid CPU-GPU execution

Cluster Computing
Scaling large-data computations on multi-GPU accelerators

Proceedings of the 27th international ACM conference on International conference on supercomputing
Orchestrating stream graphs using model checking

ACM Transactions on Architecture and Code Optimization (TACO)
Using machine learning to partition streaming programs

ACM Transactions on Architecture and Code Optimization (TACO)
A catalog of stream processing optimizations

ACM Computing Surveys (CSUR)
Algorithmic skeleton framework for the orchestration of GPU computations

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Portable and Transparent Host-Device Communication Optimization for GPGPU Environments

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multi-core architectures. This model allows programmers to specify the structure of a program as a set of filters that act upon data, and a set of communication channels between them. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on modern Graphics Processing Units (GPUs), as they support abundant parallelism in hardware. In this paper, we describe the challenges in mapping StreamIt to GPUs and propose an efficient technique to software pipeline the execution of stream programs on GPUs. We formulate this problem --- both scheduling and assignment of filters to processors --- as an efficient Integer Linear Program (ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling utilizes both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipeline parallelism. Further it takes into consideration the synchronization and bandwidth limitations of GPUs, and yields speedups between 1.87X and 36.83X over a single threaded CPU.