Selective cache ways: on-demand cache resource allocation
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Smart Memories: a modular reconfigurable architecture
Proceedings of the 27th annual international symposium on Computer architecture
Reconfigurable caches and their application to media processing
Proceedings of the 27th annual international symposium on Computer architecture
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture
Proceedings of the 30th annual international symposium on Computer architecture
Resolving Register Bank Conflicts for a Network Processor
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
ALP: Efficient support for all levels of parallelism for complex media applications
ACM Transactions on Architecture and Code Optimization (TACO)
Communications of the ACM
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Fast tridiagonal solvers on the GPU
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Resistive computation: avoiding the power wall with low-leakage, STT-MRAM based computing
Proceedings of the 37th annual international symposium on Computer architecture
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Understanding the Energy Consumption of Dynamic Random Access Memories
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Register packing for cyclic reduction: a case study
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Energy-efficient mechanisms for managing thread context in throughput processors
Proceedings of the 38th annual international symposium on Computer architecture
A compile-time managed multi-level register file hierarchy
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
ARGO: aging-aware GPGPU register file allocation
Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis
Hi-index | 0.00 |
Modern throughput processors such as GPUs employ thousands of threads to drive high-bandwidth, long-latency memory systems. These threads require substantial on-chip storage for registers, cache, and scratchpad memory. Existing designs hard-partition this local storage, fixing the capacities of these structures at design time. We evaluate modern GPU workloads and find that they have widely varying capacity needs across these different functions. Therefore, we propose a unified local memory which can dynamically change the partitioning among registers, cache, and scratchpad on a per-application basis. The tuning that this flexibility enables improves both performance and energy consumption, and broadens the scope of applications that can be efficiently executed on GPUs. Compared to a hard-partitioned design, we show that unified local memory provides a performance benefit as high as 71% along with an energy reduction up to 33%.