SP-NUCA: a cost effective dynamic non-uniform cache architecture

Authors:
Javier Merino;Valentín Puente;Pablo Prieto;José Ángel Gregorio
Affiliations:
Universidad de Cantabria, Spain;Universidad de Cantabria, Spain;Universidad de Cantabria, Spain;Universidad de Cantabria, Spain
Venue:
ACM SIGARCH Computer Architecture News
Year:
2008

Citing 21
Cited 4

Optimal Partitioning of Cache Memory

IEEE Transactions on Computers
Full-system timing-first simulation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Simics: A Full System Simulation Platform

Computer
LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies

IEEE Transactions on Computers
Token coherence: decoupling performance and correctness

Proceedings of the 30th annual international symposium on Computer architecture
A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
CQoS: a framework for enabling QoS in shared caches of CMP platforms

Proceedings of the 18th annual international conference on Supercomputing
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Managing Wire Delay in Large Chip-Multiprocessor Caches

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Improving Multiple-CMP Systems Using Token Coherence

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

Proceedings of the 32nd annual international symposium on Computer Architecture
Optimizing Replication, Communication, and Capacity Allocation in CMPs

Proceedings of the 32nd annual international symposium on Computer Architecture
ARC: A Self-Tuning, Low Overhead Replacement Cache

FAST '03 Proceedings of the 2nd USENIX Conference on File and Storage Technologies
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
A Case for MLP-Aware Cache Replacement

Proceedings of the 33rd annual international symposium on Computer Architecture
Cooperative Caching for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
ASR: Adaptive Selective Replication for CMP Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Towards hybrid last level caches for chip-multiprocessors

ACM SIGARCH Computer Architecture News
A NUCA Substrate for Flexible CMP Cache Sharing

IEEE Transactions on Parallel and Distributed Systems

Reactive NUCA: near-optimal block placement and replication in distributed caches

Proceedings of the 36th annual international symposium on Computer architecture
Reducing energy and increasing performance with traffic optimization in many-core systems

Proceedings of the System Level Interconnect Prediction Workshop
Toward on-chip datacenters: a perspective on general trends and on-chip particulars

The Journal of Supercomputing
LP-NUCA: networks-in-cache for high-performance low-power embedded processors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a simple but effective method to reduce on-chip access latency and improve core isolation in CMP Non-Uniform Cache Architectures (NUCA). The paper introduces a feasible way to allocate cache blocks according to the access pattern. Each L2 bank is dynamically partitioned at set level in private and shared content. Simply by adjusting the replacement algorithm, we can place private data closer to its owner processor. In contrast, independently of the accessing processor, shared data is always placed in the same position. This approach is capable of reducing on-chip latency without significantly sacrificing hit rates or increasing implementation cost of a conventional static NUCA. Additionally, most of the unnecessary interference between cores in private accesses is removed. To support the architectural decisions adopted and provide a comparative study, a comprehensive evaluation framework is employed. The workbench is composed of a full system simulator, and a representative set of multithreaded and multiprogrammed workloads. With this infrastructure, different alternatives for the coherence protocol, replacement policies, and cache utilization are analyzed to find the optimal proposal. We conclude that the cost for a feasible implementation should be closer to a conventional static NUCA, and significantly less than a dynamic NUCA. Finally, a comparison with static and dynamic NUCA is presented. The simulation results suggest that on average the mechanism proposed could improve system performance of a static NUCA and idealized dynamic NUCA by 16% and 6% respectively.