A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

Authors:
Mark Gebhart;Daniel R. Johnson;David Tarjan;Stephen W. Keckler;William J. Dally;Erik Lindholm;Kevin Skadron
Affiliations:
The University of Texas at Austin;University of Illinois at Urbana-Champaign;NVIDIA;NVIDIA and The University of Texas at Austin;NVIDIA and Stanford University;NVIDIA;University of Virginia
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
2012

Citing 39
Cited 1

Hierarchical registers for scientific computers

ICS '88 Proceedings of the 2nd international conference on Supercomputing
Register traffic analysis for streamlining inter-operation communication in fine-grain parallel processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Low energy memory and register allocation using network flow

DAC '97 Proceedings of the 34th annual Design Automation Conference
Compiler-controlled memory

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
APRIL: a processor architecture for multiprocessing

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Multiple-banked register file architectures

Proceedings of the 27th annual international symposium on Computer architecture
The CRAY-1 computer system

Communications of the ACM - Special issue on computer architecture
Two-level hierarchical register file organization for VLIW processors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
α-coral: a multigrain, multithreaded processor architecture

ICS '01 Proceedings of the 15th international conference on Supercomputing
A large, fast instruction window for tolerating cache misses

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A scalable instruction queue design using dependence chains

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Reducing the complexity of the register file in dynamic superscalar processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A Mechanism for Efficient Context Switching

ICCD '91 Proceedings of the 1991 IEEE International Conference on Computer Design on VLSI in Computer & Processors
Hierarchical Scheduling Windows

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Reducing register ports for higher speed and lower energy

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
The Named-State Register File: Implementation and Performance

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Energy-Efficient Register Access

SBCCI '00 Proceedings of the 13th symposium on Integrated circuits and systems design
Cyclone: a broadcast-free dynamic instruction scheduler with selective replay

Proceedings of the 30th annual international symposium on Computer architecture
Loose Loops Sink Chips

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Resolving Register Bank Conflicts for a Network Processor

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Power-aware compilation for register file energy reduction

International Journal of Parallel Programming - Special issue: Workshop on application specific processors (WASP)
Balanced Multithreading: Increasing Throughput via a Low Cost Multithreading Hierarchy

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Software and hardware techniques to optimize register file utilization in VLIW architectures

International Journal of Parallel Programming
A new register file access architecture for software pipelining in VLIW processors

Proceedings of the 2005 Asia and South Pacific Design Automation Conference
Bypass aware instruction scheduling for register file power reduction

Proceedings of the 2006 ACM SIGPLAN/SIGBED conference on Language, compilers, and tool support for embedded systems
Register file caching for energy efficiency

Proceedings of the 2006 international symposium on Low power electronics and design
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
The shared-thread multiprocessor

Proceedings of the 22nd annual international conference on Supercomputing
A closer look at GPUs

Communications of the ACM
Energy-efficient register caching with compiler assistance

ACM Transactions on Architecture and Code Optimization (TACO)
Operand Registers and Explicit Operand Forwarding

IEEE Computer Architecture Letters
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
An integrated GPU power and performance model

Proceedings of the 37th annual international symposium on Computer architecture
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Register Cache System Not for Latency Reduction Purpose

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Energy-Efficient Floating-Point Unit Design

IEEE Transactions on Computers
OUTRIDER: efficient memory latency tolerance with decoupled strands

Proceedings of the 38th annual international symposium on Computer architecture
SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading

Proceedings of the 38th annual international symposium on Computer architecture

Future of GPGPU micro-architectural parameters

Proceedings of the Conference on Design, Automation and Test in Europe

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern graphics processing units (GPUs) employ a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complex thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Reducing the number of threads that the scheduler must consider each cycle improves the scheduler’s energy efficiency. Second, we propose replacing the monolithic register file found on modern designs with a hierarchical register file. We explore various trade-offs for the hierarchy including the number of levels in the hierarchy and the number of entries at each level. We consider both a hardware-managed caching scheme and a software-managed scheme, where the compiler is responsible for orchestrating all data movement within the register file hierarchy. Combined with a hierarchical register file, our two-level thread scheduler provides a further reduction in energy by only allocating entries in the upper levels of the register file hierarchy for active threads. Averaging across a variety of real world graphics and compute workloads, the active thread count can be reduced by a factor of 4 with minimal impact on performance and our most efficient three-level software-managed register file hierarchy reduces register file energy by 54%.