Characterization and transformation of unstructured control flow in bulk synchronous GPU applications

Authors:
Haicheng Wu;Gregory Diamos;Jin Wang;Si Li;Sudhakar Yalamanchili
Affiliations:
School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA;School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA;School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA;School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA;School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA
Venue:
International Journal of High Performance Computing Applications
Year:
2012

Citing 15
Cited 2

The program dependence graph and its use in optimization

ACM Transactions on Programming Languages and Systems (TOPLAS)
Advanced compiler design and implementation

Advanced compiler design and implementation
A bandwidth-efficient architecture for media processing

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Chap - a SIMD graphics processor

SIGGRAPH '84 Proceedings of the 11th annual conference on Computer graphics and interactive techniques
Using Hammock Graphs to Structure Programs

IEEE Transactions on Software Engineering
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics

GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
OptiX: a general purpose ray tracing engine

ACM SIGGRAPH 2010 papers
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications

IISWC '10 Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'10)
Caracal: dynamic translation of runtime environments for GPUs

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Thread block compaction for efficient SIMT control flow

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
SIMD re-convergence at thread frontiers

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Efficient Mapping of Irregular C++ Applications to Integrated GPUs

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
HARP: Harnessing inactive threads in many-core processors

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we identify important classes of program control flows in applications targeted to commercially available graphics processing units (GPUs) and characterize their presence in real workloads such as those that occur in CUDA and OpenCL. Broadly, control flow can be characterized as structured or unstructured. It is shown that most existing techniques for handling divergent control in bulk synchronous GPU applications handle structured control flow efficiently, some are incapable of executing unstructured control flow directly, and none handles unstructured control flow efficiently. An approach to reduce the impact of this problem is provided. An unstructured-to-structured control flow transformation for CUDA kernels is implemented and its performance impact on a large class of GPU applications is assessed. The results quantify the importance of improving support for programs with unstructured control flow on GPUs. The transformation can also be used in a JIT compiler pass to execute programs with unstructured control flow on the GPU devices that do not support unstructured control flow. This is an important capability for execution portability of applications using GPU accelerators.