Reducing branch divergence in GPU programs

Authors:
Tianyi David Han;Tarek S. Abdelrahman
Affiliations:
University of Toronto, Toronto, Ontario, Canada;University of Toronto, Toronto, Ontario, Canada
Venue:
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Year:
2011

Citing 16
Cited 14

Advanced compiler design and implementation

Advanced compiler design and implementation
Compiler techniques for code compaction

ACM Transactions on Programming Languages and Systems (TOPLAS)
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
Program optimization carving for GPU computing

Journal of Parallel and Distributed Computing
CUDA-Lite: Reducing GPU Programming Complexity

Languages and Compilers for Parallel Computing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A control-structure splitting optimization for GPGPU

Proceedings of the 6th ACM conference on Computing frontiers
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

ACM Transactions on Architecture and Code Optimization (TACO)
Implementing the PGI Accelerator model

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping

Proceedings of the 24th ACM International Conference on Supercomputing
Dynamic warp subdivision for integrated branch and memory divergence tolerance

Proceedings of the 37th annual international symposium on Computer architecture
Data layout transformation exploiting memory-level parallelism in structured grid many-core applications

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

IEEE Transactions on Parallel and Distributed Systems
Automatic C-to-CUDA code generation for affine programs

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction

CPU/GPU computing for long-wave radiation physics on large GPU clusters

Computers & Geosciences
Feedback-Based global instruction scheduling for GPGPU applications

ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part I
Reducing thread divergence in GPU-based b&b applied to the flow-shop problem

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Exploration of motion estimation algorithm in graphics processing environment

Proceedings of the 18th Brazilian symposium on Multimedia and the web
Spill code placement for SIMD machines

SBLP'12 Proceedings of the 16th Brazilian conference on Programming Languages
Exploring alternative flexible OpenCL (FlexCL) core designs in FPGA-based MPSoC systems

Proceedings of the 2013 Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools
Reducing divergence in GPGPU programs with loop merging

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Adaptive OpenCL (ACL) execution in GPU architectures

Proceedings of the 3rd International Workshop on Adaptive Self-Tuning Computing Systems
Optimizing select conditions on GPUs

Proceedings of the Ninth International Workshop on Data Management on New Hardware
SIMD divergence optimization through intra-warp compaction

Proceedings of the 40th Annual International Symposium on Computer Architecture
Divergence analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
Evaluation of CUDA GPU architecture as H.264 intra coding acceleration engine

Proceedings of the 19th Brazilian symposium on Multimedia and the web
Accelerated event-by-event Monte Carlo microdosimetric calculations of electrons and protons tracks on a multi-core CPU and a CUDA-enabled GPU

Computer Methods and Programs in Biomedicine
GPU implementation and performance analysis of reactive agents having division and mobility capacities

Multiagent and Grid Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Branch divergence has a significant impact on the performance of GPU programs. We propose two novel software-based optimizations, called iteration delaying and branch distribution that aim to reduce branch divergence. Iteration delaying targets a divergent branch enclosed by a loop within a kernel. It improves performance by executing loop iterations that take the same branch direction and delaying those that take the other direction until later iterations. Branch distribution reduces the length of divergent code by factoring out structurally similar code from the branch paths. We conduct a preliminary evaluation of the two optimizations using both synthetic benchmarks and a highly-optimized real-world application. Our evaluation shows that they improve the performance of the synthetic benchmarks by as much as 30% and 80% respectively, and that of the real-world application by 12% and 16% respectively.