SIMD re-convergence at thread frontiers

Authors:
Gregory Diamos;Benjamin Ashbaugh;Subramaniam Maiyuran;Andrew Kerr;Haicheng Wu;Sudhakar Yalamanchili
Affiliations:
Georgia Institute of Technology, Atlanta, GA;Intel Visual Computing Group, Folsom, CA;Intel Visual Computing Group, Folsom, CA;Georgia Institute of Technology, Atlanta, GA;Georgia Institute of Technology, Atlanta, GA;Georgia Institute of Technology, Atlanta, GA
Venue:
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2011

Citing 11
Cited 11

A bridging model for parallel computation

Communications of the ACM
Chap - a SIMD graphics processor

SIGGRAPH '84 Proceedings of the 11th annual conference on Computer graphics and interactive techniques
Using Hammock Graphs to Structure Programs

IEEE Transactions on Software Engineering
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Optimizing data intensive GPGPU computations for DNA sequence alignment

Parallel Computing
A characterization and analysis of PTX kernels

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
OptiX: a general purpose ray tracing engine

ACM SIGGRAPH 2010 papers
Dynamic warp subdivision for integrated branch and memory divergence tolerance

Proceedings of the 37th annual international symposium on Computer architecture
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Quantifying NUMA and contention effects in multi-GPU systems

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Thread block compaction for efficient SIMT control flow

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture

Characterization and transformation of unstructured control flow in bulk synchronous GPU applications

International Journal of High Performance Computing Applications
Simultaneous branch and warp interweaving for sustained GPU performance

Proceedings of the 39th Annual International Symposium on Computer Architecture
CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures

Proceedings of the 39th Annual International Symposium on Computer Architecture
Reducing divergence in GPGPU programs with loop merging

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Efficient scheduling of recursive control flow on GPUs

Proceedings of the 27th international ACM conference on International conference on supercomputing
Optimizing select conditions on GPUs

Proceedings of the Ninth International Workshop on Data Management on New Hardware
Microarchitectural mechanisms to exploit value structure in SIMT architectures

Proceedings of the 40th Annual International Symposium on Computer Architecture
Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation

Proceedings of the 40th Annual International Symposium on Computer Architecture
SIMD divergence optimization through intra-warp compaction

Proceedings of the 40th Annual International Symposium on Computer Architecture
Evaluator-executor transformation for efficient pipelining of loops with conditionals

ACM Transactions on Architecture and Code Optimization (TACO)
HARP: Harnessing inactive threads in many-core processors

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hardware and compiler techniques for mapping data-parallel programs with divergent control flow to SIMD architectures have recently enabled the emergence of new GPGPU programming models such as CUDA, OpenCL, and DirectX Compute. The impact of branch divergence can be quite different depending upon whether the program's control flow is structured or unstructured. In this paper, we show that unstructured control flow occurs frequently in applications and can lead to significant code expansion when executed using existing approaches for handling branch divergence. This paper proposes a new technique for automatically mapping arbitrary control flow onto SIMD processors that relies on a concept of a Thread Frontier, which is a bounded region of the program containing all threads that have branched away from the current warp. This technique is evaluated on a GPU emulator configured to model i) a commodity GPU (Intel Sandybridge), and ii) custom hardware support not realized in current GPU architectures. It is shown that this new technique performs identically to the best existing method for structured control flow, and re-converges at the earliest possible point when executing unstructured control flow. This leads to i) between 1.5 - 633.2% reductions in dynamic instruction counts for several real applications, ii) simplification of the compilation process, and iii) ability to efficiently add high level unstructured programming constructs (e.g., exceptions) to existing data-parallel languages.