Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels

Authors:
Michael Steffen;Joseph Zambreno
Affiliations:
-;-
Venue:
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2010

Citing 13
Cited 1

A bandwidth-efficient architecture for media processing

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
A new Algorithm for slicing unstructured programs

Journal of Software Maintenance: Research and Practice
An improved illumination model for shaded display

Communications of the ACM
Multidimensional binary search trees used for associative searching

Communications of the ACM
Programmable Stream Processors

Computer
Realistic Ray Tracing

Realistic Ray Tracing
Physically Based Rendering: From Theory to Implementation

Physically Based Rendering: From Theory to Implementation
Ray tracing on programmable graphics hardware

SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
On dynamic load balancing on graphics processors

Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Amdahl's Law in the Multicore Era

Computer
GRAMPS: A programming model for graphics pipelines

ACM Transactions on Graphics (TOG)
Understanding the efficiency of ray traversal on GPUs

Proceedings of the Conference on High Performance Graphics 2009

Dynamic compilation of data-parallel kernels for vector processors

Proceedings of the Tenth International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Wide Single Instruction, Multiple Thread (SIMT)architectures often require a static allocation of thread groups that are executed in lockstep throughout the entire application kernel. Individual thread branching is supported by executing all control ﬂow paths for threads in a thread group and only committing the results of threads on the current control path. While convergence algorithms are used to maximize processorefficiency during branching operations, applications requiring complex control ﬂow often result in low processor efficiency due to the length and quantity of control paths. Global rendering algorithms are an example of a class of application that can be accelerated using a large number of independent parallel threads that each require complex control ﬂow, resulting in comparatively low efficiency on SIMT processors. To improve processor utilization for global rendering algorithms, we introduce a SIMT architecture that allows for threads to be created dynamically at runtime. Large application kernels are broken down into smaller code blocks we call µ-kernels that dynamically created threads can execute. These runtime µ-kernels allow for the removal of branching statements that would cause divergence within a thread group, and result in new threads being created and grouped with threads beginning execution of the same µ-kernel. In our evaluation of SIMT processor efficiency for a global rendering algorithms, dynamicµ-kernels improved processor performance by an average of1.4脳.