Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping

Authors:
Eddy Z. Zhang;Yunlian Jiang;Ziyu Guo;Xipeng Shen
Affiliations:
The College of William and Mary, Williamsburg, VA;The College of William and Mary, Williamsburg, VA;The College of William and Mary, Williamsburg, VA;The College of William and Mary, Williamsburg, VA
Venue:
Proceedings of the 24th ACM International Conference on Supercomputing
Year:
2010

Citing 16
Cited 11

Improving cache performance in dynamic applications through data and computation reorganization at run time

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Improving effective bandwidth through compiler enhancement of global cache reuse

Journal of Parallel and Distributed Computing
Cache-conscious coallocation of hot data streams

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Exploiting Locality for Irregular Scientific Codes

IEEE Transactions on Parallel and Distributed Systems
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Lattice Boltzmann based PDE solver on the GPU

The Visual Computer: International Journal of Computer Graphics
Fast scan algorithms on graphics processors

Proceedings of the 22nd annual international conference on Supercomputing
A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
Efficient computation of sum-products on GPUs through software-managed cache

Proceedings of the 22nd annual international conference on Supercomputing
Bandwidth intensive 3-D FFT kernel for GPUs using CUDA

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A control-structure splitting optimization for GPGPU

Proceedings of the 6th ACM conference on Computing frontiers
A cross-input adaptive framework for GPU program optimizations

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing

GRace: a low-overhead mechanism for detecting data races in GPU programs

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
On-the-fly elimination of dynamic irregularities for GPU computing

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Reducing branch divergence in GPU programs

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Optimization of N-queens solvers on graphics processors

APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
Characterizing and improving the use of demand-fetched caches in GPUs

Proceedings of the 26th ACM international conference on Supercomputing
One stone two birds: synchronization relaxation and redundancy removal in GPU-CPU translation

Proceedings of the 26th ACM international conference on Supercomputing
Reducing thread divergence in GPU-based b&b applied to the flow-shop problem

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Optimizing select conditions on GPUs

Proceedings of the Ninth International Workshop on Data Management on New Hardware
Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation

Proceedings of the 40th Annual International Symposium on Computer Architecture
Divergence analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
An Infrastructure for Tackling Input-Sensitivity of GPU Program Optimizations

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Because of their tremendous computing power and remarkable cost efficiency, GPUs (graphic processing unit) have quickly emerged as a kind of influential platform for high performance computing. However, as GPUs are designed for massive data-parallel computing, their performance is subject to the presence of condition statements in a GPU application. On a conditional branch where threads diverge in which path to take, the threads taking different paths have to run serially. Such divergences often cause serious performance degradations, impairing the adoption of GPU for many applications that contain non-trivial branches or certain types of loops. This paper presents a systematic investigation in the employment of runtime thread-data remapping for solving that problem. It introduces an abstract form of GPU applications, based on which, it describes the use of reference redirection and data layout transformation for remapping data and threads to minimize thread divergences. It discusses the major challenges for practical deployment of the remapping techniques, most notably, the conflict between the large remapping overhead and the need for the remapping to happen on the fly because of the dependence of thread divergences on runtime values. It offers a solution to the challenge by proposing a CPU-GPU pipelining scheme and a label-assign-move (LAM) algorithm to virtually hide all the remapping overhead. At the end, it reports significant performance improvement produced by the remapping for a set of GPU applications, demonstrating the potential of the techniques for streamlining GPU applications on the fly.