Energy-efficient mechanisms for managing thread context in throughput processors

Authors:
Mark Gebhart;Daniel R. Johnson;David Tarjan;Stephen W. Keckler;William J. Dally;Erik Lindholm;Kevin Skadron
Affiliations:
The University of Texas at Austin, Austin, TX, USA;University of Illinois at Urbana-Champaign, Urbana, IL, USA;NVIDIA, Santa Clara, CA, USA;NVIDIA / The University of Texas at Austin, Santa Clara, CA, USA;NVIDIA / Stanford University, Santa Clara, CA, USA;NVIDIA, Santa Clara, CA, USA;University of Virginia, Charlottesville, VA, USA
Venue:
Proceedings of the 38th annual international symposium on Computer architecture
Year:
2011

Citing 25
Cited 23

Hierarchical registers for scientific computers

ICS '88 Proceedings of the 2nd international conference on Supercomputing
Register traffic analysis for streamlining inter-operation communication in fine-grain parallel processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The Tera computer system

ICS '90 Proceedings of the 4th international conference on Supercomputing
APRIL: a processor architecture for multiprocessing

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Multiple-banked register file architectures

Proceedings of the 27th annual international symposium on Computer architecture
A large, fast instruction window for tolerating cache misses

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A scalable instruction queue design using dependence chains

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Reducing the complexity of the register file in dynamic superscalar processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A survey of processors with explicit multithreading

ACM Computing Surveys (CSUR)
A Mechanism for Efficient Context Switching

ICCD '91 Proceedings of the 1991 IEEE International Conference on Computer Design on VLSI in Computer & Processors
Hierarchical Scheduling Windows

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
The Named-State Register File: Implementation and Performance

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Cyclone: a broadcast-free dynamic instruction scheduler with selective replay

Proceedings of the 30th annual international symposium on Computer architecture
Loose Loops Sink Chips

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Resolving Register Bank Conflicts for a Network Processor

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Balanced Multithreading: Increasing Throughput via a Low Cost Multithreading Hierarchy

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
Register file caching for energy efficiency

Proceedings of the 2006 international symposium on Low power electronics and design
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
A closer look at GPUs

Communications of the ACM
Energy-efficient register caching with compiler assistance

ACM Transactions on Architecture and Code Optimization (TACO)
Operand Registers and Explicit Operand Forwarding

IEEE Computer Architecture Letters
An integrated GPU power and performance model

Proceedings of the 37th annual international symposium on Computer architecture
Register Cache System Not for Latency Reduction Purpose

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Energy-Efficient Floating-Point Unit Design

IEEE Transactions on Computers

A compile-time managed multi-level register file hierarchy

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Characterizing and improving the use of demand-fetched caches in GPUs

Proceedings of the 26th ACM international conference on Supercomputing
Energy-efficient GPU design with reconfigurable in-package graphics memory

Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
Simultaneous branch and warp interweaving for sustained GPU performance

Proceedings of the 39th Annual International Symposium on Computer Architecture
Boosting mobile GPU performance with a decoupled access/execute fragment processor

Proceedings of the 39th Annual International Symposium on Computer Architecture
Inter-warp instruction temporal locality in deep-multithreaded GPUs

ARCS'13 Proceedings of the 26th international conference on Architecture of Computing Systems
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Warped-DMR: Light-weight Error Detection for GPGPU

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Cache-Conscious Wavefront Scheduling

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Using synchronization stalls in power-aware accelerators

Proceedings of the Conference on Design, Automation and Test in Europe
Microarchitectural mechanisms to exploit value structure in SIMT architectures

Proceedings of the 40th Annual International Symposium on Computer Architecture
Orchestrated scheduling and prefetching for GPGPUs

Proceedings of the 40th Annual International Symposium on Computer Architecture
An energy-efficient and scalable eDRAM-based register file architecture for GPGPU

Proceedings of the 40th Annual International Symposium on Computer Architecture
APOGEE: adaptive prefetching on GPUs for energy efficiency

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Neither more nor less: optimizing thread-level parallelism for GPGPUs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
A locality-aware memory hierarchy for energy-efficient GPU architectures

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Divergence-aware warp scheduling

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Warped gates: gating aware scheduling and power gating for GPGPUs

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Optimizing GPU energy efficiency with 3D die-stacking graphics memory and reconfigurable memory interface

ACM Transactions on Architecture and Code Optimization (TACO)
Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications

Proceedings of Workshop on General Purpose Processing Using GPUs

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern graphics processing units (GPUs) use a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complicated thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we examine register file caching to replace accesses to the large main register file with accesses to a smaller structure containing the immediate register working set of active threads. Second, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Combined with register file caching, a two-level thread scheduler provides a further reduction in energy by limiting the allocation of temporary register cache resources to only the currently active subset of threads. We show that on average, across a variety of real world graphics and compute workloads, a 6-entry per-thread register file cache reduces the number of reads and writes to the main register file by 50% and 59% respectively. We further show that the active thread count can be reduced by a factor of 4 with minimal impact on performance, resulting in a 36% reduction of register file energy.