Proximity-aware directory-based coherence for multi-core processor architectures

Authors:
Jeffery A. Brown;Rakesh Kumar;Dean Tullsen
Affiliations:
University of California, San Diego, La Jolla, CA;University of Illinois at Urbana-Champaign, Urbana, IL;University of California, San Diego, La Jolla, CA
Venue:
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Year:
2007

Citing 17
Cited 14

The Stanford Dash Multiprocessor

Computer
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
An empirical evaluation of two memory-efficient directory methods

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Cache-Only Memory Architectures

Computer
A Novel Approach to Reduce L2 Miss Latency in Shared-Memory Multiprocessors

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Reducing Remote Conflict Misses: NUMA with Remote Cache versus COMA

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Design and Performance of Directory Caches for Scalable Shared Memory Multiprocessors

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Token coherence: decoupling performance and correctness

Proceedings of the 30th annual international symposium on Computer architecture
The optimum pipeline depth considering both power and performance

ACM Transactions on Architecture and Code Optimization (TACO)
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors

Proceedings of the 32nd annual international symposium on Computer Architecture
Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling

Proceedings of the 32nd annual international symposium on Computer Architecture
A NUCA substrate for flexible CMP cache sharing

Proceedings of the 19th annual international conference on Supercomputing
Cooperative Caching for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
In-Network Cache Coherence

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
RSIM: a simulator for shared-memory multiprocessor and uniprocessor systems that exploit ILP

WCAE-3 '97 Proceedings of the 1997 workshop on Computer architecture education

Utilizing shared data in chip multiprocessors with the Nahalal architecture

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Scalable and reliable communication for hardware transactional memory

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A Novel Cache Organization for Tiled Chip Multiprocessor

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Impact of level-2 cache sharing on the performance and power requirements of homogeneous multicore embedded systems

Microprocessors & Microsystems
Compiler-based data classification for hybrid caching

Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
Proximity coherence for chip multiprocessors

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
SWEL: hardware cache coherence protocols to map shared data onto shared caches

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Energy- and Performance-Efficient Communication Framework for Embedded MPSoCs through Application-Driven Release Consistency

ACM Transactions on Design Automation of Electronic Systems (TODAES)
NoC-aware cache design for multithreaded execution on tiled chip multiprocessors

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Run-time energy management of manycore systems through reconfigurable interconnects

Proceedings of the 21st edition of the great lakes symposium on Great lakes symposium on VLSI
A modular simulator framework for network-on-chip based manycore chips using UNISIM

Transactions on High-Performance Embedded Architectures and Compilers IV
A hybrid NoC design for cache coherence optimization for chip multiprocessors

Proceedings of the 49th Annual Design Automation Conference
Practically private: enabling high performance CMPs through compiler-assisted data classification

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Predicting Coherence Communication by Tracking Synchronization Points at Run Time

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the number of cores increases on chip multiprocessors, coherence is fast becoming a central issue for multi-core performance. This is exacerbated by the fact that interconnection speeds are not scaling well with technology. This paper describes mechanisms to accelerate coherence for a multi-core architecture that has multiple private L2 caches and a scalable point-to-point interconnect between cores. These techniques exploit the differences in geometry between chip multiprocessors and traditional multiprocessor architectures. Directory-based protocols have been proposed as a scalable alternative to snoop-based protocols. In this paper, we discuss implementations of coherence for CMPs and propose and evaluate a novel directory-based coherence scheme to improve the performance of parallel programs on such processors. Proximity-aware coherence accelerates read and write misses by initiating cache-to-cache transfers from the spatially closest sharer. This has the dual benefit of eliminating unnecessary accesses to off-chip memory, and minimizing the distance over which communicated data moves across the network. The proposed schemes result in speedups up to 74.9% for our workloads.