A compile-time managed multi-level register file hierarchy

Authors:
Mark Gebhart;Stephen W. Keckler;William J. Dally
Affiliations:
The University of Texas at Austin, Austin, TX;The University of Texas at Austin, Austin, TX, and NVIDIA, Santa Clara, CA;NVIDIA, Santa Clara, CA, and Stanford University, Stanford, CA
Venue:
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2011

Citing 22
Cited 2

Hierarchical registers for scientific computers

ICS '88 Proceedings of the 2nd international conference on Supercomputing
Register traffic analysis for streamlining inter-operation communication in fine-grain parallel processors

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Low energy memory and register allocation using network flow

DAC '97 Proceedings of the 34th annual Design Automation Conference
Compiler-controlled memory

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Linear scan register allocation

ACM Transactions on Programming Languages and Systems (TOPLAS)
The CRAY-1 computer system

Communications of the ACM - Special issue on computer architecture
Two-level hierarchical register file organization for VLIW processors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
α-coral: a multigrain, multithreaded processor architecture

ICS '01 Proceedings of the 15th international conference on Supercomputing
Reducing register ports for higher speed and lower energy

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Energy-Efficient Register Access

SBCCI '00 Proceedings of the 13th symposium on Integrated circuits and systems design
Power-aware compilation for register file energy reduction

International Journal of Parallel Programming - Special issue: Workshop on application specific processors (WASP)
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Software and hardware techniques to optimize register file utilization in VLIW architectures

International Journal of Parallel Programming
A new register file access architecture for software pipelining in VLIW processors

Proceedings of the 2005 Asia and South Pacific Design Automation Conference
Bypass aware instruction scheduling for register file power reduction

Proceedings of the 2006 ACM SIGPLAN/SIGBED conference on Language, compilers, and tool support for embedded systems
Operand Registers and Explicit Operand Forwarding

IEEE Computer Architecture Letters
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
An integrated GPU power and performance model

Proceedings of the 37th annual international symposium on Computer architecture
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
OUTRIDER: efficient memory latency tolerance with decoupled strands

Proceedings of the 38th annual international symposium on Computer architecture
Energy-efficient mechanisms for managing thread context in throughput processors

Proceedings of the 38th annual international symposium on Computer architecture
SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading

Proceedings of the 38th annual international symposium on Computer architecture

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

As processors increasingly become power limited, performance improvements will be achieved by rearchitecting systems with energy efficiency as the primary design constraint. While some of these optimizations will be hardware based, combined hardware and software techniques likely will be the most productive. This work redesigns the register file system of a modern throughput processor with a combined hardware and software solution that reduces register file energy without harming system performance. Throughput processors utilize a large number of threads to tolerate latency, requiring a large, energy-intensive register file to store thread context. Our results show that a compiler controlled register file hierarchy can reduce register file energy by up to 54%, compared to a hardware only caching approach that reduces register file energy by 34%. We explore register allocation algorithms that are specifically targeted to improve energy efficiency by sharing temporary register file resources across concurrently running threads and conduct a detailed limit study on the further potential to optimize operand delivery for throughput processors. Our efficiency gains represent a direct performance gain for power limited systems, such as GPUs.