Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor

Authors:
Mark Gebhart;Stephen W. Keckler;Brucek Khailany;Ronny Krashinsky;William J. Dally
Affiliations:
-;-;-;-;-
Venue:
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2012

Citing 16
Cited 1

Selective cache ways: on-demand cache resource allocation

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Smart Memories: a modular reconfigurable architecture

Proceedings of the 27th annual international symposium on Computer architecture
Reconfigurable caches and their application to media processing

Proceedings of the 27th annual international symposium on Computer architecture
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture

Proceedings of the 30th annual international symposium on Computer architecture
Resolving Register Bank Conflicts for a Network Processor

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
ALP: Efficient support for all levels of parallelism for complex media applications

ACM Transactions on Architecture and Code Optimization (TACO)
A closer look at GPUs

Communications of the ACM
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Fast tridiagonal solvers on the GPU

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Resistive computation: avoiding the power wall with low-leakage, STT-MRAM based computing

Proceedings of the 37th annual international symposium on Computer architecture
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Understanding the Energy Consumption of Dynamic Random Access Memories

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Register packing for cyclic reduction: a case study

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Energy-efficient mechanisms for managing thread context in throughput processors

Proceedings of the 38th annual international symposium on Computer architecture
A compile-time managed multi-level register file hierarchy

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

ARGO: aging-aware GPGPU register file allocation

Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern throughput processors such as GPUs employ thousands of threads to drive high-bandwidth, long-latency memory systems. These threads require substantial on-chip storage for registers, cache, and scratchpad memory. Existing designs hard-partition this local storage, fixing the capacities of these structures at design time. We evaluate modern GPU workloads and find that they have widely varying capacity needs across these different functions. Therefore, we propose a unified local memory which can dynamically change the partitioning among registers, cache, and scratchpad on a per-application basis. The tuning that this flexibility enables improves both performance and energy consumption, and broadens the scope of applications that can be efficiently executed on GPUs. Compared to a hard-partitioned design, we show that unified local memory provides a performance benefit as high as 71% along with an energy reduction up to 33%.