Cooperative Caching for Chip Multiprocessors

Authors:
Jichuan Chang;Gurindar S. Sohi
Affiliations:
University of Wisconsin-Madison;University of Wisconsin-Madison
Venue:
Proceedings of the 33rd annual international symposium on Computer Architecture
Year:
2006

Citing 33
Cited 78

On the inclusion properties for multi-level cache hierarchies

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
A cache coherence approach for large multiprocessor systems

ICS '88 Proceedings of the 2nd international conference on Supercomputing
DDM: A Cache-Only Memory Architecture

Computer
Implementing global memory management in a workstation cluster

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Evaluation of design alternatives for a multiprocessor microprocessor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Performance isolation: sharing and isolation in shared-memory multiprocessors

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Summary cache: a scalable wide-area web cache sharing protocol

IEEE/ACM Transactions on Networking (TON)
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Simics: A Full System Simulation Platform

Computer
Simulating a $2M Commercial Server on a $2K PC

Computer
Exploring the Design Space of Future CMPs

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
A low-overhead coherence solution for multiprocessors with private cache memories

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
An argument for simple COMA

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
A Shared-bus Control Mechanism and a Cache Coherence Protocol for a High-performance On-chip Multiprocessor

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Token coherence: decoupling performance and correctness

Proceedings of the 30th annual international symposium on Computer architecture
Distance Associativity for High-Performance Energy-Efficient Non-Uniform Cache Architectures

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
CQoS: a framework for enabling QoS in shared caches of CMP platforms

Proceedings of the 18th annual international conference on Supercomputing
Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Managing Wire Delay in Large Chip-Multiprocessor Caches

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Improving Multiple-CMP Systems Using Token Coherence

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

Proceedings of the 32nd annual international symposium on Computer Architecture
Optimizing Replication, Communication, and Capacity Allocation in CMPs

Proceedings of the 32nd annual international symposium on Computer Architecture
The V-Way Cache: Demand Based Associativity via Global Replacement

Proceedings of the 32nd annual international symposium on Computer Architecture
Organizing the Last Line of Defense before Hitting the Memory Wall for CMPs

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
Fast and fair: data-stream quality of service

Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
A NUCA substrate for flexible CMP cache sharing

Proceedings of the 19th annual international conference on Supercomputing
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research

IEEE Computer Architecture Letters
Cooperative caching: using remote client memory to improve file system performance

OSDI '94 Proceedings of the 1st USENIX conference on Operating Systems Design and Implementation
High-throughput coherence control and hardware messaging in everest

IBM Journal of Research and Development
POWER4 system microarchitecture

IBM Journal of Research and Development

Computation spreading: employing hardware migration to specialize CMP cores on-the-fly

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
A flexible data to L2 cache mapping approach for future multicore processors

Proceedings of the 2006 workshop on Memory system performance and correctness
Coherence Ordering for Ring-based Chip Multiprocessors

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
ASR: Adaptive Selective Replication for CMP Caches

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
CMP cache performance projection: accessibility vs. capacity

ACM SIGARCH Computer Architecture News
Proximity-aware directory-based coherence for multi-core processor architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Virtual hierarchies to support server consolidation

Proceedings of the 34th annual international symposium on Computer architecture
Interconnect design considerations for large NUCA caches

Proceedings of the 34th annual international symposium on Computer architecture
Cooperative cache partitioning for chip multiprocessors

Proceedings of the 21st annual international conference on Supercomputing
A reusability-aware cache memory sharing technique for high-performance low-power CMPs with private L2 caches

ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
Adaptive set pinning: managing shared caches in chip multiprocessors

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Visions for application development on hybrid computing systems

Parallel Computing
Utilizing shared data in chip multiprocessors with the Nahalal architecture

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
A consistency architecture for hierarchical shared caches

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
SP-NUCA: a cost effective dynamic non-uniform cache architecture

ACM SIGARCH Computer Architecture News
Towards hybrid last level caches for chip-multiprocessors

ACM SIGARCH Computer Architecture News
A novel migration-based NUCA design for chip multiprocessors

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
On the performance benefits of sharing and privatizing second and third-level cache memories in homogeneous multi-core architectures

Microprocessors & Microsystems
Distributed cooperative caching

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Improving support for locality and fine-grain sharing in chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Leveraging on-chip networks for data cache migration in chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
A leakage-aware cache sharing technique for low-power chip multi-processors (CMPs) with private L2 caches

Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
Dynamic cache clustering for chip multiprocessors

Proceedings of the 23rd international conference on Supercomputing
Push-assisted migration of real-time tasks in multi-core processors

Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Multi-execution: multicore caching for data-similar executions

Proceedings of the 36th annual international symposium on Computer architecture
Reactive NUCA: near-optimal block placement and replication in distributed caches

Proceedings of the 36th annual international symposium on Computer architecture
Triplet-based topology for on-chip networks

WSEAS Transactions on Computers
A Novel Cache Organization for Tiled Chip Multiprocessor

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
L1 Collective Cache: Managing Shared Data for Chip Multiprocessors

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Performance balancing: software-based on-chip memory management for effective CMP executions

Proceedings of the 10th workshop on MEmory performance: DEaling with Applications, systems and architecture
Reusability-aware cache memory sharing for chip multiprocessors with private L2 caches

Journal of Systems Architecture: the EUROMICRO Journal
Chameleon: Virtualizing idle acceleration cores of a heterogeneous multicore processor for caching and prefetching

ACM Transactions on Architecture and Code Optimization (TACO)
An analysis of on-chip interconnection networks for large-scale chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
Compiler-based data classification for hybrid caching

Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
Two-phase trace-driven simulation (TPTS): a fast multicore processor architecture simulation approach

Software—Practice & Experience
Direct coherence: bringing together performance and scalability in shared-memory multiprocessors

HiPC'07 Proceedings of the 14th international conference on High performance computing
Performance and energy trade-offs analysis of L2 on-chip cache architectures for embedded MPSoCs

Proceedings of the 20th symposium on Great lakes symposium on VLSI
Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Adaptive L2 cache for chip multiprocessors

Euro-Par'07 Proceedings of the 2007 conference on Parallel processing
Cache topology aware computation mapping for multicores

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Software data spreading: leveraging distributed caches to improve single thread performance

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
The auction: optimizing banks usage in Non-Uniform Cache Architectures

Proceedings of the 24th ACM International Conference on Supercomputing
Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors

Proceedings of the 37th annual international symposium on Computer architecture
Replication-aware leakage management in chip multiprocessors with private L2 cache

Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design
Handling the problems and opportunities posed by multiple on-chip memory controllers

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
ATAC: a 1000-core cache-coherent processor with on-chip optical network

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
NoC-aware cache design for chip multiprocessors

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Group-caching for NoC based multicore cache coherent systems

Proceedings of the Conference on Design, Automation and Test in Europe
Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Power-efficient spilling techniques for chip multiprocessors

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Thread owned block cache: managing latency in many-core architecture

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Architectural support for thread communications in multi-core processors

Parallel Computing
Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Cache equalizer: a placement mechanism for chip multiprocessor distributed shared caches

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
NoC-aware cache design for multithreaded execution on tiled chip multiprocessors

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Inter-core prefetching for multicore processors using migrating helper threads

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Memory-, bandwidth-, and power-aware multi-core for a graph database workload

ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
Design and management of 3D-stacked NUCA cache for chip multiprocessors

Proceedings of the 21st edition of the great lakes symposium on Great lakes symposium on VLSI
Research note: C-AMTE: A location mechanism for flexible cache management in chip multiprocessors

Journal of Parallel and Distributed Computing
Evaluating placement policies for managing capacity sharing in CMP architectures with private caches

ACM Transactions on Architecture and Code Optimization (TACO)
A cache-partitioning aware replacement policy for chip multiprocessors

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
DAPSCO: Distance-aware partially shared cache organization

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
The gradient-based cache partitioning algorithm

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Throttling capacity sharing in private L2 caches of CMPs

Proceedings of the 2011 ACM Symposium on Research in Applied Computation
Evaluating Dynamics and Bottlenecks of Memory Collaboration in Cluster Systems

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Practically private: enabling high performance CMPs through compiler-assisted data classification

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
A collaborative memory system for high-performance and cost-effective clustered architectures

Proceedings of the 1st Workshop on Architectures and Systems for Big Data
A survey on cache tuning from a power/energy perspective

ACM Computing Surveys (CSUR)
Dynamic directories: a mechanism for reducing on-chip interconnect power in multicores

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Jigsaw: scalable software-defined caches

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Locality-oblivious cache organization leveraging single-cycle multi-hop NoCs

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Exploiting replication to improve performances of NUCA-based CMP systems

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents CMP Cooperative Caching, a unified framework to manage a CMP's aggregate on-chip cache resources. Cooperative caching combines the strengths of private and shared cache organizations by forming an aggregate "shared" cache through cooperation among private caches. Locally active data are attracted to the private caches by their accessing processors to reduce remote on-chip references, while globally active data are cooperatively identified and kept in the aggregate cache to reduce off-chip accesses. Examples of cooperation include cache-to-cache transfers of clean data, replication-aware data replacement, and global replacement of inactive data. These policies can be implemented by modifying an existing cache replacement policy and cache coherence protocol, or by the new implementation of a directory-based protocol presented in this paper. Our evaluation using full-system simulation shows that cooperative caching achieves an off-chip miss rate similar to that of a shared cache, and a local cache hit rate similar to that of using private caches. Cooperative caching performs robustly over a range of system/cache sizes and memory latencies. For an 8-core CMP with 1MB L2 cache per core, the best cooperative caching scheme improves the performance of multithreaded commercial workloads by 5-11% compared with a shared cache and 4-38% compared with private caches. For a 4-core CMP running multiprogrammed SPEC2000 workloads, cooperative caching is on average 11% and 6% faster than shared and private cache organizations, respectively.