Energy-efficient mechanisms for managing thread context in throughput processors

  • Authors:
  • Mark Gebhart;Daniel R. Johnson;David Tarjan;Stephen W. Keckler;William J. Dally;Erik Lindholm;Kevin Skadron

  • Affiliations:
  • The University of Texas at Austin, Austin, TX, USA;University of Illinois at Urbana-Champaign, Urbana, IL, USA;NVIDIA, Santa Clara, CA, USA;NVIDIA / The University of Texas at Austin, Santa Clara, CA, USA;NVIDIA / Stanford University, Santa Clara, CA, USA;NVIDIA, Santa Clara, CA, USA;University of Virginia, Charlottesville, VA, USA

  • Venue:
  • Proceedings of the 38th annual international symposium on Computer architecture
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Modern graphics processing units (GPUs) use a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complicated thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we examine register file caching to replace accesses to the large main register file with accesses to a smaller structure containing the immediate register working set of active threads. Second, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Combined with register file caching, a two-level thread scheduler provides a further reduction in energy by limiting the allocation of temporary register cache resources to only the currently active subset of threads. We show that on average, across a variety of real world graphics and compute workloads, a 6-entry per-thread register file cache reduces the number of reads and writes to the main register file by 50% and 59% respectively. We further show that the active thread count can be reduced by a factor of 4 with minimal impact on performance, resulting in a 36% reduction of register file energy.