Advanced compiler design and implementation
Advanced compiler design and implementation
Compiler techniques for code compaction
ACM Transactions on Programming Languages and Systems (TOPLAS)
Program optimization carving for GPU computing
Journal of Parallel and Distributed Computing
CUDA-Lite: Reducing GPU Programming Complexity
Languages and Compilers for Parallel Computing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A control-structure splitting optimization for GPGPU
Proceedings of the 6th ACM conference on Computing frontiers
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware
ACM Transactions on Architecture and Code Optimization (TACO)
Implementing the PGI Accelerator model
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
A GPGPU compiler for memory optimization and parallelism management
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Proceedings of the 24th ACM International Conference on Supercomputing
Dynamic warp subdivision for integrated branch and memory divergence tolerance
Proceedings of the 37th annual international symposium on Computer architecture
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures
IEEE Transactions on Parallel and Distributed Systems
Automatic C-to-CUDA code generation for affine programs
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
CPU/GPU computing for long-wave radiation physics on large GPU clusters
Computers & Geosciences
Feedback-Based global instruction scheduling for GPGPU applications
ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part I
Reducing thread divergence in GPU-based b&b applied to the flow-shop problem
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Exploration of motion estimation algorithm in graphics processing environment
Proceedings of the 18th Brazilian symposium on Multimedia and the web
Spill code placement for SIMD machines
SBLP'12 Proceedings of the 16th Brazilian conference on Programming Languages
Exploring alternative flexible OpenCL (FlexCL) core designs in FPGA-based MPSoC systems
Proceedings of the 2013 Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools
Reducing divergence in GPGPU programs with loop merging
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Adaptive OpenCL (ACL) execution in GPU architectures
Proceedings of the 3rd International Workshop on Adaptive Self-Tuning Computing Systems
Optimizing select conditions on GPUs
Proceedings of the Ninth International Workshop on Data Management on New Hardware
SIMD divergence optimization through intra-warp compaction
Proceedings of the 40th Annual International Symposium on Computer Architecture
ACM Transactions on Programming Languages and Systems (TOPLAS)
Evaluation of CUDA GPU architecture as H.264 intra coding acceleration engine
Proceedings of the 19th Brazilian symposium on Multimedia and the web
Computer Methods and Programs in Biomedicine
Multiagent and Grid Systems
Hi-index | 0.00 |
Branch divergence has a significant impact on the performance of GPU programs. We propose two novel software-based optimizations, called iteration delaying and branch distribution that aim to reduce branch divergence. Iteration delaying targets a divergent branch enclosed by a loop within a kernel. It improves performance by executing loop iterations that take the same branch direction and delaying those that take the other direction until later iterations. Branch distribution reduces the length of divergent code by factoring out structurally similar code from the branch paths. We conduct a preliminary evaluation of the two optimizations using both synthetic benchmarks and a highly-optimized real-world application. Our evaluation shows that they improve the performance of the synthetic benchmarks by as much as 30% and 80% respectively, and that of the real-world application by 12% and 16% respectively.