Hierarchical registers for scientific computers
ICS '88 Proceedings of the 2nd international conference on Supercomputing
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
ICS '90 Proceedings of the 4th international conference on Supercomputing
APRIL: a processor architecture for multiprocessing
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Multiple-banked register file architectures
Proceedings of the 27th annual international symposium on Computer architecture
A large, fast instruction window for tolerating cache misses
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A scalable instruction queue design using dependence chains
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Reducing the complexity of the register file in dynamic superscalar processors
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A survey of processors with explicit multithreading
ACM Computing Surveys (CSUR)
A Mechanism for Efficient Context Switching
ICCD '91 Proceedings of the 1991 IEEE International Conference on Computer Design on VLSI in Computer & Processors
Hierarchical Scheduling Windows
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
The Named-State Register File: Implementation and Performance
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Cyclone: a broadcast-free dynamic instruction scheduler with selective replay
Proceedings of the 30th annual international symposium on Computer architecture
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Resolving Register Bank Conflicts for a Network Processor
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Balanced Multithreading: Increasing Throughput via a Low Cost Multithreading Hierarchy
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Register file caching for energy efficiency
Proceedings of the 2006 international symposium on Low power electronics and design
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
Communications of the ACM
Energy-efficient register caching with compiler assistance
ACM Transactions on Architecture and Code Optimization (TACO)
Operand Registers and Explicit Operand Forwarding
IEEE Computer Architecture Letters
An integrated GPU power and performance model
Proceedings of the 37th annual international symposium on Computer architecture
Register Cache System Not for Latency Reduction Purpose
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Energy-Efficient Floating-Point Unit Design
IEEE Transactions on Computers
A compile-time managed multi-level register file hierarchy
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Characterizing and improving the use of demand-fetched caches in GPUs
Proceedings of the 26th ACM international conference on Supercomputing
Energy-efficient GPU design with reconfigurable in-package graphics memory
Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design
Simultaneous branch and warp interweaving for sustained GPU performance
Proceedings of the 39th Annual International Symposium on Computer Architecture
Boosting mobile GPU performance with a decoupled access/execute fragment processor
Proceedings of the 39th Annual International Symposium on Computer Architecture
Inter-warp instruction temporal locality in deep-multithreaded GPUs
ARCS'13 Proceedings of the 26th international conference on Architecture of Computing Systems
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Warped-DMR: Light-weight Error Detection for GPGPU
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Cache-Conscious Wavefront Scheduling
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Using synchronization stalls in power-aware accelerators
Proceedings of the Conference on Design, Automation and Test in Europe
Microarchitectural mechanisms to exploit value structure in SIMT architectures
Proceedings of the 40th Annual International Symposium on Computer Architecture
Orchestrated scheduling and prefetching for GPGPUs
Proceedings of the 40th Annual International Symposium on Computer Architecture
An energy-efficient and scalable eDRAM-based register file architecture for GPGPU
Proceedings of the 40th Annual International Symposium on Computer Architecture
APOGEE: adaptive prefetching on GPUs for energy efficiency
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Neither more nor less: optimizing thread-level parallelism for GPGPUs
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
A locality-aware memory hierarchy for energy-efficient GPU architectures
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Divergence-aware warp scheduling
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Warped gates: gating aware scheduling and power gating for GPGPUs
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
ACM Transactions on Architecture and Code Optimization (TACO)
Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications
Proceedings of Workshop on General Purpose Processing Using GPUs
Hi-index | 0.00 |
Modern graphics processing units (GPUs) use a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complicated thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we examine register file caching to replace accesses to the large main register file with accesses to a smaller structure containing the immediate register working set of active threads. Second, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Combined with register file caching, a two-level thread scheduler provides a further reduction in energy by limiting the allocation of temporary register cache resources to only the currently active subset of threads. We show that on average, across a variety of real world graphics and compute workloads, a 6-entry per-thread register file cache reduces the number of reads and writes to the main register file by 50% and 59% respectively. We further show that the active thread count can be reduced by a factor of 4 with minimal impact on performance, resulting in a 36% reduction of register file energy.