Accelerator: using data parallelism to program GPUs for general-purpose uses

Authors:
David Tarditi;Sidd Puri;Jose Oglesby
Affiliations:
Microsoft Research;Microsoft Research;Microsoft Research
Venue:
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Year:
2006

Citing 13
Cited 80

An APL Compiler for a Vector Processor

ACM Transactions on Programming Languages and Systems (TOPLAS)
A user-programmable vertex engine

Proceedings of the 28th annual conference on Computer graphics and interactive techniques
Compilation and delayed evaluation in APL

POPL '78 Proceedings of the 5th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
The Paralation Model: Architecture-Independent Parallel Programming

The Paralation Model: Architecture-Independent Parallel Programming
Introductory Techniques for 3-D Computer Vision

Introductory Techniques for 3-D Computer Vision
ZPL: An Array Sublanguage

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
NESL: A Nested Data-Parallel Language (Version 2.6)

NESL: A Nested Data-Parallel Language (Version 2.6)
Cg: a system for programming graphics hardware in a C-like language

ACM SIGGRAPH 2003 Papers
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Metaprogramming GPUs with Sh

Metaprogramming GPUs with Sh
GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation (Gpu Gems)

GPU Gems 2: Programming Techniques for High-Performance Graphics and General-Purpose Computation (Gpu Gems)
The GeForce 6800

IEEE Micro
The Direct3D 10 system

ACM SIGGRAPH 2006 Papers

EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Scout: a data-parallel programming language for graphics processors

Parallel Computing
Cache-efficient numerical algorithms using graphics hardware

Parallel Computing
Exploring weak scalability for FEM calculations on a GPU-enhanced cluster

Parallel Computing
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Merge: a programming model for heterogeneous multi-core systems

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Streamware: programming general-purpose multicore processors using streams

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
BSGP: bulk-synchronous GPU programming

ACM SIGGRAPH 2008 papers
Accelerating advanced mri reconstructions on gpus

Proceedings of the 5th conference on Computing frontiers
GPU acceleration of cutoff pair potentials for molecular modeling applications

Proceedings of the 5th conference on Computing frontiers
A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
Relational joins on graphics processors

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Enabling semantic communications for virtual machines via iConnect

VTDC '07 Proceedings of the 2nd international workshop on Virtualization technology in distributed computing
Implicitly-threaded parallelism in Manticore

Proceedings of the 13th ACM SIGPLAN international conference on Functional programming
Accelerating advanced MRI reconstructions on GPUs

Journal of Parallel and Distributed Computing
A performance study of general-purpose applications on graphics processors using CUDA

Journal of Parallel and Distributed Computing
Program optimization carving for GPU computing

Journal of Parallel and Distributed Computing
Bandwidth intensive 3-D FFT kernel for GPUs using CUDA

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Revisiting SIMD Programming

Languages and Compilers for Parallel Computing
Mars: a MapReduce framework on graphics processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
GRAMPS: A programming model for graphics pipelines

ACM Transactions on Graphics (TOG)
Accuracy and performance of graphics processors: A Quantum Monte Carlo application case study

Parallel Computing
Optimizing the parallel computation of linear recurrences using compact matrix representations

Journal of Parallel and Distributed Computing
Accelerating total variation regularization for matrix-valued images on GPUs

Proceedings of the 6th ACM conference on Computing frontiers
Stream processing for fast and efficient rotated Haar-like features using rotated integral images

International Journal of Intelligent Systems Technologies and Applications
A translation system for enabling data mining applications on GPUs

Proceedings of the 23rd international conference on Supercomputing
Synergistic execution of stream programs on multicores with accelerators

Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Software Pipelined Execution of Stream Programs on GPUs

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Automatic parallelization for graphics processing units

PPPJ '09 Proceedings of the 7th International Conference on Principles and Practice of Programming in Java
A Practical Approach of Curved Ray Prestack Kirchhoff Time Migration on GPGPU

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Tile Percolation: An OpenMP Tile Aware Parallelization Technique for the Cyclops-64 Multicore Processor

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Relational query coprocessing on graphics processors

ACM Transactions on Database Systems (TODS)
Compiler support for general-purpose computation on GPUs

The Journal of Supercomputing
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
An analytical model to exploit memory task scheduling

Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
SIMD: an additional pattern for PLPP (pattern language for parallel programming)

Proceedings of the 14th Conference on Pattern Languages of Programs
Streams: emerging from a shared memory model

IWOMP'08 Proceedings of the 4th international conference on OpenMP in a new era of parallelism
Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations

Proceedings of the 24th ACM International Conference on Supercomputing
Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Genetic programming on GPUs for image processing

International Journal of High Performance Systems Architecture
Deployment of parallel linear genetic programming using GPUs on PC and video game console platforms

Genetic Programming and Evolvable Machines
Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
A GPGPU transparent virtualization component for high performance computing clouds

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Accelerating Haskell array codes with multicore GPUs

Proceedings of the sixth workshop on Declarative aspects of multicore programming
Breaking the GPU programming barrier with the auto-parallelising SAC compiler

Proceedings of the sixth workshop on Declarative aspects of multicore programming
A domain-specific approach to heterogeneous parallelism

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Copperhead: compiling an embedded data parallel language

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Bothnia: a dual-personality extension to the Intel integrated graphics driver

ACM SIGOPS Operating Systems Review
Practical parallel and concurrent programming

Proceedings of the 42nd ACM technical symposium on Computer science education
Implicitly threaded parallelism in manticore

Journal of Functional Programming
Computing without processors

Communications of the ACM
A programming model for GPU-based parallel computing with scalability and abstraction

Proceedings of the 25th Spring Conference on Computer Graphics
Practical loop transformations for tensor contraction expressions on multi-level memory hierarchies

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Steno: automatic optimization of declarative queries

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
MDR: performance model driven runtime for heterogeneous parallel platforms

Proceedings of the international conference on Supercomputing
Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework

Proceedings of the 20th international symposium on High performance distributed computing
Computing without Processors

Queue - Interoperability
Computing prestack Kirchhoff time migration on general purpose GPU

Computers & Geosciences
Obsidian: a domain specific embedded language for parallel programming of graphics processors

IFL'08 Proceedings of the 20th international conference on Implementation and application of functional languages
The jabberwocky programming environment for structured social computing

Proceedings of the 24th annual ACM symposium on User interface software and technology
Firepile: run-time compilation for GPUs in scala

Proceedings of the 10th ACM international conference on Generative programming and component engineering
PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation

Parallel Computing
Accelerator compiler for the VENICE vector processor

Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays
Paragon: collaborative speculative loop execution on GPU and CPU

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
A compiler and runtime for heterogeneous computing

Proceedings of the 49th Annual Design Automation Conference
Mainstream parallel array programming on cell

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Compiling a high-level language for GPUs: (via language support for architectures and compilers)

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Adaptive input-aware compilation for graphics engines

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Compiler and runtime support for enabling reduction computations on heterogeneous systems

Concurrency and Computation: Practice & Experience
Parakeet: a just-in-time parallel accelerator for python

HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
The compiler forest

ESOP'13 Proceedings of the 22nd European conference on Programming Languages and Systems
Parallel execution of Java loops on Graphics Processing Units

Science of Computer Programming
GPU acceleration of regular expression matching for large datasets: exploring the implementation space

Proceedings of the ACM International Conference on Computing Frontiers
ViperVM: a runtime system for parallel functional high-performance computing on heterogeneous architectures

Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
Embrace, defend, extend: a methodology for embedding preexisting DSLs

Proceedings of the 1st annual workshop on Functional programming concepts in domain-specific languages
Leveraging GPUs using cooperative loop speculation

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.02

Visualization

Abstract

GPUs are difficult to program for general-purpose uses. Programmers can either learn graphics APIs and convert their applications to use graphics pipeline operations or they can use stream programming abstractions of GPUs. We describe Accelerator, a system that uses data parallelism to program GPUs for general-purpose uses instead. Programmers use a conventional imperative programming language and a library that provides only high-level data-parallel operations. No aspects of GPUs are exposed to programmers. The library implementation compiles the data-parallel operations on the fly to optimized GPU pixel shader code and API calls.We describe the compilation techniques used to do this. We evaluate the effectiveness of using data parallelism to program GPUs by providing results for a set of compute-intensive benchmarks. We compare the performance of Accelerator versions of the benchmarks against hand-written pixel shaders. The speeds of the Accelerator versions are typically within 50% of the speeds of hand-written pixel shader code. Some benchmarks significantly outperform C versions on a CPU: they are up to 18 times faster than C code running on a CPU.