Simultaneous branch and warp interweaving for sustained GPU performance

Authors:
Nicolas Brunie;Sylvain Collange;Gregory Diamos
Affiliations:
Kalray and ENS de Lyon;Universidade Federal de Minas Gerais;NVIDIA Research
Venue:
Proceedings of the 39th Annual International Symposium on Computer Architecture
Year:
2012

Citing 22
Cited 7

Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Multithreaded Vector Architectures

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Control Flow Optimization Via Dynamic Reconvergence Prediction

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Conjoined-Core Chip Multiprocessing

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
The Vector-Thread Architecture

IEEE Micro
Vector Lane Threading

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Transparent control independence (TCI)

Proceedings of the 34th annual international symposium on Computer architecture
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

ACM Transactions on Architecture and Code Optimization (TACO)
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
The GPU Computing Era

IEEE Micro
Dynamic warp subdivision for integrated branch and memory divergence tolerance

Proceedings of the 37th annual international symposium on Computer architecture
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
Dynamic detection of uniform and affine vectors in GPGPU computations

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Barra: A Parallel Functional Simulator for GPGPU

MASCOTS '10 Proceedings of the 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Energy-efficient mechanisms for managing thread context in throughput processors

Proceedings of the 38th annual international symposium on Computer architecture
Thread block compaction for efficient SIMT control flow

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
PEPSC: A Power-Efficient Processor for Scientific Computing

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Improving GPU performance via large warps and two-level warp scheduling

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
SIMD re-convergence at thread frontiers

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Towards Solving the Table Maker's Dilemma on GPU

PDP '12 Proceedings of the 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing

Warp size impact in GPUs: large or small?

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Efficient scheduling of recursive control flow on GPUs

Proceedings of the 27th international ACM conference on International conference on supercomputing
Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation

Proceedings of the 40th Annual International Symposium on Computer Architecture
SIMD divergence optimization through intra-warp compaction

Proceedings of the 40th Annual International Symposium on Computer Architecture
Evaluator-executor transformation for efficient pipelining of loops with conditionals

ACM Transactions on Architecture and Code Optimization (TACO)
HARP: Harnessing inactive threads in many-core processors

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Leveraging GPUs using cooperative loop speculation

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Single-Instruction Multiple-Thread (SIMT) micro-architectures implemented in Graphics Processing Units (GPUs) run fine-grained threads in lockstep by grouping them into units, referred to as warps, to amortize the cost of instruction fetch, decode and control logic over multiple execution units. As individual threads take divergent execution paths, their processing takes place sequentially, defeating part of the efficiency advantage of SIMD execution. We present two complementary techniques that mitigate the impact of thread divergence on SIMT micro-architectures. Both techniques relax the SIMD execution model by allowing two distinct instructions to be scheduled to disjoint subsets of the the same row of execution units, instead of one single instruction. They increase flexibility by providing more thread grouping opportunities than SIMD, while preserving the affinity between threads to avoid introducing extra memory divergence. We consider (1) co-issuing instructions from different divergent paths of the same warp and (2) co-issuing instructions from different warps. To support (1), we introduce a novel thread reconvergence technique that ensures threads are run back in lockstep at control-flow reconvergence points without hindering their ability to run branches in parallel. We propose a lane shuffling technique to allow solution (2) to benefit from inter-warp correlations in divergence patterns. The combination of all these techniques improves performance by 23% on a set of regular GPGPU applications and by 40% on irregular applications, while maintaining the same instruction-fetch and processing-unit resource requirements as the contemporary Fermi GPU architecture.