Advanced compiler design and implementation
Advanced compiler design and implementation
Using Hammock Graphs to Structure Programs
IEEE Transactions on Software Engineering
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware
ACM Transactions on Architecture and Code Optimization (TACO)
Optimizing data intensive GPGPU computations for DNA sequence alignment
Parallel Computing
On-the-fly elimination of dynamic irregularities for GPU computing
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Reducing branch divergence in GPU programs
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
SIMD re-convergence at thread frontiers
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
Branch divergence can incur a high performance penalty on GPGPU programs. We propose a software optimization, called loop merging, that aims to reduce divergence due to varying trip-count of a loop across warp threads. This optimization merges the divergent loop with one or more outer surrounding loops into one loop. In this way, warp threads do not have to wait for each other in each outer loop iteration, thus improving execution efficiency. We implement loop merging in LLVM. Our evaluation on a Fermi GPU shows that it improves the performance of a synthetic benchmark and five application benchmarks by up to 1.6X and 4.3X respectively.