APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation

Authors:
Yulong Yu;Xubin He;He Guo;Sihui Zhong;Yuxin Wang;Xin Chen;Weijun Xiao
Affiliations:
School of Software Technology, Dalian University of Technology, Dalian, China;Department of Electrical and Computer Engineering, Virginia Commonwealth University Richmond, VA, USA;School of Software Technology, Dalian University of Technology, Dalian, China;School of Software Technology, Dalian University of Technology, Dalian, China;School of Computer Science and Technology, Dalian University of Technology, Dalian, China;School of Software Technology, Dalian University of Technology, Dalian, China;Department of Electrical and Computer Engineering, Virginia Commonwealth University Richmond, VA, USA
Venue:
Proceedings of Workshop on General Purpose Processing Using GPUs
Year:
2014

Citing 19
Cited 0

A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
Program optimization carving for GPU computing

Journal of Parallel and Distributed Computing
GpuCV: an opensource GPU-accelerated framework forimage processing and computer vision

MM '08 Proceedings of the 16th ACM international conference on Multimedia
CUDA-Lite: Reducing GPU Programming Complexity

Languages and Compilers for Parallel Computing
SkePU: a multi-backend skeleton programming library for multi-GPU systems

Proceedings of the fourth international workshop on High-level parallel programming and applications
OpenMPC: Extended OpenMP Programming and Tuning for GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Exploiting More Parallelism from Applications Having Generalized Reductions on GPU Architectures

CIT '10 Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology
hiCUDA: High-Level GPGPU Programming

IEEE Transactions on Parallel and Distributed Systems
A programming language interface to describe transformations and code generation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Mint: realizing CUDA performance in 3D stencil methods with annotated C

Proceedings of the international conference on Supercomputing
Dymaxion: optimizing memory access patterns for heterogeneous systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Introducing 'Bones': a parallelizing source-to-source compiler based on algorithmic skeletons

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
CUDA Application Design and Development

CUDA Application Design and Development
Automatic C-to-CUDA code generation for affine programs

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Accelerating a 3D Finite-Difference Earthquake Simulation with a C-to-CUDA Translator

Computing in Science and Engineering
Automatic restructuring of GPU kernels for exploiting inter-thread data locality

CC'12 Proceedings of the 21st international conference on Compiler Construction
A script-based autotuning compiler system to generate high-performance CUDA code

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Polyhedral parallel code generation for CUDA

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
A large-scale cross-architecture evaluation of thread-coarsening

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

General-purpose graphics processing units (GPGPU) brings an opportunity to improve the performance for many applications. However, exploiting parallelism is low productive in current programming frameworks such as CUDA and OpenCL. Programmers have to consider and deal with many GPGPU architecture details; therefore it is a challenge to trade off the programmability and the efficiency of performance tuning. Parallel Repacking (PR) is a popular performance tuning approach for GPGPU applications, which improves the performance by changing the parallel granularity. Existing code transformation algorithms using PR increase the productivity, but they do not cover adequate code patterns and do not give an effective code error detection. In this paper, we propose a novel parallel repacking algorithm (APR) to cover a wide range of code patterns and improve efficiency. We develop an efficient code model that expresses a GPGPU program as a recursive statement sequence, and introduces a concept of singular statement. APR building upon this model uses appropriate transformation rules for singular and non-singular statements to generate the repacked codes. A recursive transformation is performed when it encounters a branching/loop singular statement. Additionally, singular statements unify the transformation for barriers and data sharing, and enable APR to detect the barrier errors. The experiment results based on a prototype show that out proposed APR covers more code patterns than existing solutions such as the automatic thread coarsening in Crest, and the repacked codes using the APR achieve effective performance gain up to 3.28X speedup, in some cases even higher than manually tuned repacked codes.