Decoupled dynamic cache segmentation

Authors:
Samira M. Khan;Zhe Wang;Daniel A. Jimenez
Affiliations:
Department of Computer Science, The University of Texas at San Antonio;Department of Computer Science, The University of Texas at San Antonio;Department of Computer Science, The University of Texas at San Antonio
Venue:
HPCA '12 Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture
Year:
2012

Citing 0
Cited 4

Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Exploiting reuse locality on inclusive shared last-level caches

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
The reuse cache: downsizing the shared last-level cache

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
WADE: Writeback-aware dynamic cache management for NVM-based main memory system

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The least recently used (LRU) replacement policy performs poorly in the last-level cache (LLC) because temporal locality of memory accesses is filtered by first and second level caches. We propose a cache segmentation technique that dynamically adapts to cache access patterns by predicting the best number of not-yet-referenced and already-referenced blocks in the cache. This technique is independent from the LRU policy so it can work with less expensive replacement policies. It can automatically detect when to bypass blocks to the CPU with no extra overhead. In a 2MB LLC single-core processor with a memory intensive subset of SPEC CPU 2006 benchmarks, it outperforms LRU replacement on average by 5.2% with not-recently-used (NRU) replacement and on average by 2.2% with random replacement. The technique also complements existing shared cache partitioning techniques. Our evaluation with 10 multi-programmed workloads shows that this technique improves performance of an 8MB LLC four-core system on average by 12%, with a random replacement policy requiring only half the space of the LRU policy.