CUDA-Lite: Reducing GPU Programming Complexity

Authors:
Sain-Zee Ueng;Melvin Lathara;Sara S. Baghsorkhi;Wen-Mei W. Hwu
Affiliations:
Center for Reliable and High-Performance Computing Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign,;Center for Reliable and High-Performance Computing Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign,;Center for Reliable and High-Performance Computing Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign,;Center for Reliable and High-Performance Computing Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign,
Venue:
Languages and Compilers for Parallel Computing
Year:
2008

Citing 12
Cited 41

A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Compiler-directed scratch pad memory hierarchy design and management

Proceedings of the 39th annual Design Automation Conference
Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications

EDTC '97 Proceedings of the 1997 European conference on Design and Test
An integrated simdization framework using virtual vectors

Proceedings of the 19th annual international conference on Supercomputing
Optimizing data permutations for SIMD devices

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Compilation for explicitly managed memory hierarchies

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Programming with tiles

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Accelerating advanced mri reconstructions on gpus

Proceedings of the 5th conference on Computing frontiers
The spec# programming system: an overview

CASSIS'04 Proceedings of the 2004 international conference on Construction and Analysis of Safe, Secure, and Interoperable Smart Devices

Extending abstract GPU APIs to shared memory

Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companion
memCUDA: map device memory to host memory on GPGPU platform

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
OpenMPC: Extended OpenMP Programming and Tuning for GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Towards metaprogramming for parallel systems on a chip

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Compiler-directed memory management for heterogeneous MPSoCs

Journal of Systems Architecture: the EUROMICRO Journal
On-the-fly elimination of dynamic irregularities for GPU computing

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Sponge: portable stream programming on graphics engines

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Reducing branch divergence in GPU programs

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
A programming language interface to describe transformations and code generation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Unified parallel C for GPU clusters: language extensions and compiler implementation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Optimizing the exploitation of multicore processors and GPUs with OpenMP and OpenCL

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Automatic CPU-GPU communication management and optimization

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
A platform-independent tool for modeling parallel programs

Proceedings of the 49th Annual Southeast Regional Conference
CuMAPz: a tool to analyze memory access patterns in CUDA

Proceedings of the 48th Design Automation Conference
PTask: operating system abstractions to manage GPUs as compute devices

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
CUDACL+: a framework for GPU programs

Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companion
GROPHECY: GPU performance projection from CPU code skeletons

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Introducing 'Bones': a parallelizing source-to-source compiler based on algorithmic skeletons

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
A unified optimizing compiler framework for different GPGPU architectures

ACM Transactions on Architecture and Code Optimization (TACO)
Adaptive input-aware compilation for graphics engines

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
A virtual memory based runtime to support multi-tenancy in clusters with GPUs

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
A compiler-assisted runtime-prefetching scheme for heterogeneous platforms

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Financial software on GPUs: between Haskell and Fortran

Proceedings of the 1st ACM SIGPLAN workshop on Functional high-performance computing
GPUstore: harnessing GPU computing for storage systems in the OS kernel

Proceedings of the 5th Annual International Systems and Storage Conference
Dataflow-driven GPU performance projection for multi-kernel transformations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A script-based autotuning compiler system to generate high-performance CUDA code

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Polyhedral parallel code generation for CUDA

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
OpenMPC: extended OpenMP for efficient programming and tuning on GPUs

International Journal of Computational Science and Engineering
From latex specifications to parallel codes

The Journal of Supercomputing
From physics model to results: an optimizing framework for cross-architecture code generation

Proceedings of the Extreme Scaling Workshop
Scaling large-data computations on multi-GPU accelerators

Proceedings of the 27th international ACM conference on International conference on supercomputing
Skeletal based programming for dynamic programming on MultiGPU systems

The Journal of Supercomputing
Memory performance estimation of CUDA programs

ACM Transactions on Embedded Computing Systems (TECS) - Special issue on application-specific processors
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Dandelion: a compiler and runtime for heterogeneous systems

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
User transparent data and task parallel multimedia computing with Pyxis-DT

Future Generation Computer Systems
Adaptive Mapping and Parameter Selection Scheme to Improve Automatic Code Generation for GPUs

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
The Implementation of a High Performance GPGPU Compiler

International Journal of Parallel Programming
APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation

Proceedings of Workshop on General Purpose Processing Using GPUs
From physics model to results: An optimizing framework for cross-architecture code generation

Scientific Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

The computer industry has transitioned into multi-core and many-core parallel systems. The CUDA programming environment from NVIDIA is an attempt to make programming many-core GPUs more accessible to programmers. However, there are still many burdens placed upon the programmer to maximize performance when using CUDA. One such burden is dealing with the complex memory hierarchy. Efficient and correct usage of the various memories is essential, making a difference of 2-17x in performance. Currently, the task of determining the appropriate memory to use and the coding of data transfer between memories is still left to the programmer. We believe that this task can be better performed by automated tools. We present CUDA-lite, an enhancement to CUDA, as one such tool. We leverage programmer knowledge via annotations to perform transformations and show preliminary results that indicate auto-generated code can have performance comparable to hand coding.