A code motion technique for accelerating general-purpose computation on the GPU

Authors:
Takatoshi Ikeda;Fumihiko Ino;Kenichi Hagihara
Affiliations:
Graduate School of Information Science and Technology, Osaka University, Toyonaka, Osaka, Japan;Graduate School of Information Science and Technology, Osaka University, Toyonaka, Osaka, Japan;Graduate School of Information Science and Technology, Osaka University, Toyonaka, Osaka, Japan
Venue:
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Year:
2006

Citing 19
Cited 3

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
Real-time robot motion planning using rasterizing computer graphics hardware

SIGGRAPH '90 Proceedings of the 17th annual conference on Computer graphics and interactive techniques
Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
Fast computation of generalized Voronoi diagrams using graphics hardware

Proceedings of the 26th annual conference on Computer graphics and interactive techniques
Real-Time Rendering

Real-Time Rendering
Physically-based visual simulation on graphics hardware

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Fast matrix multiplies using graphics hardware

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Imagine: Media Processing with Streams

IEEE Micro
Real-Time Shader Programming, Using DirectX 9.0

Real-Time Shader Programming, Using DirectX 9.0
Cg: a system for programming graphics hardware in a C-like language

ACM SIGGRAPH 2003 Papers
Sparse matrix solvers on the GPU: conjugate gradients and multigrid

ACM SIGGRAPH 2003 Papers
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
The GeForce 6800

IEEE Micro
Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A Proposed Standard for Binary Floating-Point Arithmetic

Computer
Performance study of LU decomposition on the programmable GPU

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Multi-grain parallel processing of data-clustering on programmable graphics hardware

ISPA'04 Proceedings of the Second international conference on Parallel and Distributed Processing and Applications

High-performance cone beam reconstruction using CUDA compatible GPUs

Parallel Computing
Accelerating cone beam reconstruction using the CUDA-enabled GPU

HiPC'08 Proceedings of the 15th international conference on High performance computing
A GPGPU approach for accelerating 2-d/3-d rigid registration of medical images

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recently, graphics processing units (GPUs) are providing increasingly higher performance with programmable internal processors, namely vertex processors (VPs) and fragment processors (FPs). Such newly added capabilities motivate us to perform general-purpose computation on GPUs (GPGPU) beyond graphics applications. Although VPs and FPs are connected in a pipeline, many GPGPU implementations utilize only FPs as a computational engine in the GPU. Therefore, such implementations may result in lower performance due to highly loaded FPs (as compared to VPs) being a performance bottleneck in the pipeline execution. The objective of our work is to improve the performance of GPGPU programs by eliminating this bottleneck. To achieve this, we present a code motion technique that is capable of reducing the FP workload by moving assembly instructions appropriately from the FP program to the VP program. We also present the definition of such movable instructions that do not change the I/O specification between the CPU and the GPU. The experimental results show that (1) our technique improves the performance of a Gaussian filter program with reducing execution time by approximately 40% and (2) it successfully reduces the FP workload in 10 out of 18 GPGPU programs.