Optimal Partitioning of Cache Memory
IEEE Transactions on Computers
Evaluating stream buffers as a secondary cache replacement
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Selective cache ways: on-demand cache resource allocation
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Managing multi-configuration hardware via dynamic working set analysis
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Extending the reach of microprocessors: column and curious caching
Extending the reach of microprocessors: column and curious caching
Dynamic Partitioning of Shared Cache Memory
The Journal of Supercomputing
Performance evaluation of cache replacement policies for the SPEC CPU2000 benchmark suite
ACM-SE 42 Proceedings of the 42nd annual Southeast regional conference
CQoS: a framework for enabling QoS in shared caches of CMP platforms
Proceedings of the 18th annual international conference on Supercomputing
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
A Case for MLP-Aware Cache Replacement
Proceedings of the 33rd annual international symposium on Computer Architecture
Cooperative Caching for Chip Multiprocessors
Proceedings of the 33rd annual international symposium on Computer Architecture
An analytical model for cache replacement policy performance
SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Architectural support for operating system-driven CMP cache management
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
From chaos to QoS: case studies in CMP resource management
ACM SIGARCH Computer Architecture News
Proceedings of the 34th annual international symposium on Computer architecture
Adaptive insertion policies for high performance caching
Proceedings of the 34th annual international symposium on Computer architecture
QoS policies and architecture for cache/memory in CMP platforms
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A Framework for Providing Quality of Service in Chip Multi-Processors
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Adaptive insertion policies for managing shared caches
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
High performance cache replacement using re-reference interval prediction (RRIP)
Proceedings of the 37th annual international symposium on Computer architecture
A cache-partitioning aware replacement policy for chip multiprocessors
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Hi-index | 0.00 |
This paper addresses the problem of partitioning a cache between multiple concurrent threads and in the presence of hardware prefetching. Cache replacement designed to preserve temporal locality (e.g., LRU) will allocate cache resources proportional to the miss-rate of each competing thread irrespective of whether the cache space will be utilized [Qureshi and Patt 2006]. This is clearly suboptimal as applications vary dramatically in their use of recently accessed data. We address this problem by partitioning a shared cache such that a global goodness metric is optimized. This paper introduces the Gradient-based Cache Partitioning Algorithm (GPA), whose variants optimize either hitrate, total instructions per cycle (IPC) or a weighted IPC metric designed to enforce Quality of Service (QoS) [Iyer 2004]. In the context of QoS, GPA enables us to obtain the maximum throughput of low-priority threads, while ensuring high performance on high-priority threads. The GPA mechanism is robust, low-cost, integrates easily with existing cache designs and improves the throughput of an in-order 8-core system sharing an 8MB L3 cache by ∼14%.