ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Dynamic IPC/clock rate optimization
Proceedings of the 25th annual international symposium on Computer architecture
Simultaneous multithreading: maximizing on-chip parallelism
25 years of the international symposia on Computer architecture (selected papers)
Selective cache ways: on-demand cache resource allocation
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
SMT Layout Overhead and Scalability
IEEE Transactions on Parallel and Distributed Systems
Power and performance evaluation of globally asynchronous locally synchronous processors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Reducing set-associative cache energy via way-prediction and selective direct-mapping
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Handling long-latency loads in a simultaneous multithreading processor
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Dynamic frequency and voltage control for a multiple clock domain microarchitecture
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Interfacing Synchronous and Asynchronous Modules Within a High-Speed Pipeline
ARVLSI '97 Proceedings of the 17th Conference on Advanced Research in VLSI (ARVLSI '97)
Front-End Policies for Improved Issue Efficiency in SMT Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Profile-based dynamic voltage and frequency scaling for a multiple clock domain microprocessor
Proceedings of the 30th annual international symposium on Computer architecture
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
The Impact of Resource Partitioning on SMT Processors
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Dynamically Trading Frequency for Complexity in a GALS Microprocessor
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Dynamically Controlled Resource Allocation in SMT Processors
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Adaptive Caches: Effective Shaping of Cache Behavior to Workloads
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Dynamic capacity-speed tradeoffs in SMT processor caches
HiPEAC'07 Proceedings of the 2nd international conference on High performance embedded architectures and compilers
Adaptive Cache Memories for SMT Processors
DSD '10 Proceedings of the 2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools
Hi-index | 0.00 |
Resizable caches can trade-off capacity for access speed to dynamically match the needs of the workload. In single-threaded cores, resizable caches have demonstrated their ability to improve processor performance by adapting to the phases of the running application. In Simultaneous Multi-Threaded (SMT) cores, the caching needs can vary greatly across the number of threads and their characteristics, thus, offering even more opportunities to dynamically adjust cache resources to the workload. In this paper, we demonstrate that the preferred control methodology for data cache reconfiguring in a SMT core changes as the number of running threads increases. In workloads with one or two threads, the resizable cache control algorithm should optimize for cache miss behavior because misses typically form the critical path. In contrast, with several independent threads running, we show that optimizing for cache hit behavior has more impact, since large SMT workloads have other threads to run during a cache miss. Moreover, we demonstrate that these seemingly diametrically opposed policies are closely related mathematically; the former minimizes the arithmetic mean cache access time (which we will call AMAT), while the latter minimizes its harmonic mean. We introduce an algorithm (HAMAT) that smoothly and naturally adjusts between the two strategies with the degree of multi-threading. We extend a previously proposed Globally Asynchronous, Locally Synchronous (GALS) processor core with SMT support and dynamically resizable caches. We show that the HAMAT algorithm significantly outperforms the AMAT algorithm on four-thread workloads while matching its performance on one and two thread workloads. Moreover, HAMAT achieves overall performance improvements of 18.7%, 10.1%, and 14.2% on one, two, and four thread workloads, respectively, over the best fixed-configuration cache design.