Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Authors:
Wilson W. L. Fung;Ivan Sham;George Yuan;Tor M. Aamodt
Affiliations:
-;-;-;-
Venue:
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2007

Citing 0
Cited 62

Tradeoffs in designing accelerator architectures for visual computing

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
How GPUs can outperform ASICs for fast LDPC decoding

Proceedings of the 23rd international conference on Supercomputing
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

ACM Transactions on Architecture and Code Optimization (TACO)
Stream compaction for deferred shading

Proceedings of the Conference on High Performance Graphics 2009
Increasing memory miss tolerance for SIMD cores

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Using the graphics processor unit to realize data streaming operations

Proceedings of the 6th Middleware Doctoral Symposium
COMPASS: a programmable data prefetcher using idle GPU shaders

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping

Proceedings of the 24th ACM International Conference on Supercomputing
Dynamic warp subdivision for integrated branch and memory divergence tolerance

Proceedings of the 37th annual international symposium on Computer architecture
An instruction-systolic programmable shader architecture for multi-threaded 3D graphics processing

Journal of Parallel and Distributed Computing
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Throughput-Effective On-Chip Networks for Manycore Accelerators

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
On-the-fly elimination of dynamic irregularities for GPU computing

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Sponge: portable stream programming on graphics engines

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Massively Parallel Logic Simulation with GPUs

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Elastic pipeline: addressing GPU on-chip shared memory bank conflicts

Proceedings of the 8th ACM International Conference on Computing Frontiers
Automatic OpenCL device characterization: guiding optimized kernel design

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Optimization of N-queens solvers on graphics processors

APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
Massively parallel identification of intersection points for GPGPU ray tracing

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
Exploring the limits of GPGPU scheduling in control flow bound applications

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Hardware transactional memory for GPU architectures

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Improving GPU performance via large warps and two-level warp scheduling

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
SIMD re-convergence at thread frontiers

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Dynamic task-scheduling and resource management for GPU accelerators in medical imaging

ARCS'12 Proceedings of the 25th international conference on Architecture of Computing Systems
On software design for stochastic processors

Proceedings of the 49th Annual Design Automation Conference
Characterization and transformation of unstructured control flow in bulk synchronous GPU applications

International Journal of High Performance Computing Applications
On the correctness of the SIMT execution model of GPUs

ESOP'12 Proceedings of the 21st European conference on Programming Languages and Systems
One stone two birds: synchronization relaxation and redundancy removal in GPU-CPU translation

Proceedings of the 26th ACM international conference on Supercomputing
CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures

Proceedings of the 39th Annual International Symposium on Computer Architecture
Boosting mobile GPU performance with a decoupled access/execute fragment processor

Proceedings of the 39th Annual International Symposium on Computer Architecture
Reducing thread divergence in GPU-based b&b applied to the flow-shop problem

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Softshell: dynamic scheduling on GPUs

ACM Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH Asia 2012
RISE: improving the streaming processors reliability against soft errors in gpgpus

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Multi2Sim: a simulation framework for CPU-GPU computing

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Branch and data herding: reducing control and memory divergence for error-tolerant GPU applications

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient design space exploration of GPGPU architectures

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Interleaving and lock-step semantics for analysis and verification of GPU kernels

ESOP'13 Proceedings of the 22nd European conference on Programming Languages and Systems
GPUDet: a deterministic GPU architecture

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Cache-Conscious Wavefront Scheduling

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Warp size impact in GPUs: large or small?

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Efficient scheduling of recursive control flow on GPUs

Proceedings of the 27th international ACM conference on International conference on supercomputing
Cost-effective soft-error protection for SRAM-based structures in GPGPUs

Proceedings of the ACM International Conference on Computing Frontiers
Optimizing select conditions on GPUs

Proceedings of the Ninth International Workshop on Data Management on New Hardware
Future of GPGPU micro-architectural parameters

Proceedings of the Conference on Design, Automation and Test in Europe
Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation

Proceedings of the 40th Annual International Symposium on Computer Architecture
SIMD divergence optimization through intra-warp compaction

Proceedings of the 40th Annual International Symposium on Computer Architecture
GPUWattch: enabling energy optimizations in GPGPUs

Proceedings of the 40th Annual International Symposium on Computer Architecture
Designing on-chip networks for throughput accelerators

ACM Transactions on Architecture and Code Optimization (TACO)
Neither more nor less: optimizing thread-level parallelism for GPGPUs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Divergence analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
A locality-aware memory hierarchy for energy-efficient GPU architectures

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Divergence-aware warp scheduling

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Energy efficient GPU transactional memory via space-time optimizations

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Rhythm: harnessing data parallel hardware for server workloads

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
An Infrastructure for Tackling Input-Sensitivity of GPU Program Optimizations

International Journal of Parallel Programming
HARP: Harnessing inactive threads in many-core processors

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
A fast scalable implementation of the two-dimensional triangular Discrete Element Method on a GPU platform

Advances in Engineering Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent advances in graphics processing units (GPUs) have resulted in massively parallel hardware that is easily programmable and widely available in commodity desktop computer systems. GPUs typically use single-instruction, multiple-data (SIMD) pipelines to achieve high perfor- mance with minimal overhead incurred by control hard- ware. Scalar threads are grouped together into SIMD batches, sometimes referred to as warps. While SIMD is ideally suited for simple programs, recent GPUs include control flow instructions in the GPU instruction set archi- tecture and programs using these instructions may experi- ence reduced performance due to the way branch execution is supported by hardware. One approach is to add a stack to allow different SIMD processing elements to execute dis- tinct program paths after a branch instruction. The occur- rence of diverging branch outcomes for different processing elements significantly degrades performance. In this paper, we explore mechanisms for more efficient SIMD branch ex- ecution on GPUs. We show that a realistic hardware im- plementation that dynamically regroups threads into new warps on the fly following the occurrence of diverging branch outcomes improves performance by an average of 20.7% for an estimated area increase of 4.7%.