Locality & utility co-optimization for practical capacity management of shared last level caches

Authors:
Dongyuan Zhan;Hong Jiang;Sharad C. Seth
Affiliations:
University of Nebraska - Lincoln, Lincoln, NE, USA;University of Nebraska - Lincoln, Lincoln, NE, USA;University of Nebraska - Lincoln, Lincoln, NE, USA
Venue:
Proceedings of the 26th ACM international conference on Supercomputing
Year:
2012

Citing 21
Cited 0

Efficient Hardware Hashing Functions for High Performance Computers

IEEE Transactions on Computers
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

Proceedings of the 32nd annual international symposium on Computer Architecture
The M5 Simulator: Modeling Networked Systems

IEEE Micro
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
ASR: Adaptive Selective Replication for CMP Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Virtual private caches

Proceedings of the 34th annual international symposium on Computer architecture
Adaptive insertion policies for high performance caching

Proceedings of the 34th annual international symposium on Computer architecture
Adaptive insertion policies for managing shared caches

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches

Proceedings of the 36th annual international symposium on Computer architecture
Reactive NUCA: near-optimal block placement and replication in distributed caches

Proceedings of the 36th annual international symposium on Computer architecture
Cache Sharing Management for Performance Fairness in Chip Multiprocessors

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Evaluation techniques for storage hierarchies

IBM Systems Journal
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
High performance cache replacement using re-reference interval prediction (RRIP)

Proceedings of the 37th annual international symposium on Computer architecture
Using dead blocks as a virtual victim cache

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
NUcache: An efficient multicore cache organization based on Next-Use distance

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
SHiP: signature-based hit predictor for high performance caching

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Shared last-level caches (SLLCs) on chip-multiprocessors play an important role in bridging the performance gap between processing cores and main memory. Although there are already many proposals targeted at overcoming the weaknesses of the least-recently-used (LRU) replacement policy by optimizing either locality or utility for heterogeneous workloads, very few of them are suitable for practical SLLC designs due to their large overhead of log associativity bits per cache line for re-reference interval prediction. The two recently proposed practical replacement policies, TA-DRRIP and SHiP, have significantly reduced the overhead by relying on just 2 bits per line for prediction, but they are oriented towards managing locality only, missing the opportunity provided by utility optimization. This paper is motivated by our two key experimental observations: (i) the not-recently-used (NRU) replacement policy that entails only one bit per line for prediction can satisfactorily approximate the LRU performance; (ii) since locality and utility optimization opportunities are concurrently present in heterogeneous workloads, the co-optimization of both would be indispensable to higher performance but is missing in existing practical SLLC schemes. Therefore, we propose a novel practical SLLC design, called COOP, which needs just one bit per line for re-reference interval prediction, and leverages lightweight per-core locality & utility monitors that profile sample SLLC sets to guide the co-optimization. COOP offers significant throughput improvement over LRU by 7.67% on a quad-core CMP with a 4MB SLLC for 200 random workloads, outperforming both of the recent practical replacement policies at the in-between cost of 17.74KB storage overhead (TA-DRRIP: 4.53% performance improvement with 16KB storage cost; SHiP: 6.00% performance improvement with 25.75KB storage overhead).