Divergence Analysis and Optimizations

Authors:
Bruno Coutinho;Diogo Sampaio;Fernando Magno Quintao Pereira;Wagner Meira Jr.
Affiliations:
-;-;-;-
Venue:
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Year:
2011

Citing 0
Cited 8

GKLEE: concolic verification and test generation for GPUs

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Dynamic compilation of data-parallel kernels for vector processors

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Spill code placement for SIMD machines

SBLP'12 Proceedings of the 16th Brazilian conference on Programming Languages
Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement

Proceedings of the 27th international ACM conference on International conference on supercomputing
Microarchitectural mechanisms to exploit value structure in SIMT architectures

Proceedings of the 40th Annual International Symposium on Computer Architecture
A large-scale cross-architecture evaluation of thread-coarsening

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Divergence analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
Leveraging GPUs using cooperative loop speculation

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The growing interest in GPU programming has brought renewed attention to the Single Instruction Multiple Data (SIMD) execution model. SIMD machines give application developers a tremendous computational power, however, the model also brings restrictions. In particular, processing elements (PEs) execute in lock-step, and may lose performance due to divergences caused by conditional branches. In face of divergences, some PEs execute, while others wait, this alternation ending when they reach a synchronization point. In this paper we introduce divergence analysis, a static analysis that determines which program variables will have the same values for every PE. This analysis is useful in three different ways: it improves the translation of SIMD code to non-SIMD CPUs, it helps developers to manually improve their SIMD applications, and it also guides the compiler in the optimization of SIMD programs. We demonstrate this last point by introducing branch fusion, a new compiler optimization that identifies, via a gene sequencing algorithm, chains of similarities between divergent program paths, and weaves these paths together as much as possible. Our implementation has been accepted in the Ocelot open-source CUDA compiler, and is publicly available. We have tested it on many industrial-strength GPU benchmarks, including Rodinia and the Nvidia's SDK. Our divergence analysis has a 34% false-positive rate, compared to the results of a dynamic profiler. Our automatic optimization adds a 3% speed-up onto parallel quick sort, a heavily optimized benchmark. Our manual optimizations extend this number to over 10%.