Improving coherence protocol reactiveness by trading bandwidth for latency

Authors:
Lucia G. Menezo;Valentin Puente;Pablo Abad;Jose Angel Gregorio
Affiliations:
University of Cantabria, Santander, Spain;University of Cantabria, Santander, Spain;University of Cantabria, Santander, Spain;University of Cantabria, Santander, Spain
Venue:
Proceedings of the 9th conference on Computing Frontiers
Year:
2012

Citing 27
Cited 1

The Stanford Dash Multiprocessor

Computer
Optimization of high-performance superscalar architectures for energy efficiency

ISLPED '00 Proceedings of the 2000 international symposium on Low power electronics and design
Timestamp snooping: an approach for extending SMPs

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
A Theory of Deadlock-Free Adaptive Multicast Routing in Wormhole Networks

IEEE Transactions on Parallel and Distributed Systems
Simulating a $2M Commercial Server on a $2K PC

Computer
Efficient Handling of Message-Dependent Deadlock

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
The AMD Opteron Processor for Multiprocessor Servers

IEEE Micro
Power Efficient Processor Architecture and The Cell Processor

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Improving Multiple-CMP Systems Using Token Coherence

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
A NUCA substrate for flexible CMP cache sharing

Proceedings of the 19th annual international conference on Supercomputing
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Low-power network-on-chip for high-performance SoC design

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Three-dimensional integrated circuits

IBM Journal of Research and Development - Advanced silicon technology
Coherence Ordering for Ring-based Chip Multiprocessors

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
An Effective Starvation Avoidance Mechanism to Enhance the Token Coherence Protocol

PDP '07 Proceedings of the 15th Euromicro International Conference on Parallel, Distributed and Network-Based Processing
Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Token Coherence: A New Framework for Shared-Memory Multiprocessors

IEEE Micro
Amdahl's Law in the Multicore Era

Computer
Accelerating critical section execution with asymmetric multi-core architectures

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Token tenure: PATCHing token counting using directory-based cache coherence

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Power7: IBM's Next-Generation Server Processor

IEEE Micro
Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor

IEEE Micro
ORION 2.0: a fast and accurate NoC power and area model for early-stage design space exploration

Proceedings of the Conference on Design, Automation and Test in Europe
SICOSYS: an integrated framework for studying interconnection network performance in multiprocessor systems

EUROMICRO-PDP'02 Proceedings of the 10th Euromicro conference on Parallel, distributed and network-based processing

The case for a scalable coherence protocol for complex on-chip cache hierarchies in many core systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes how on-chip network particularities could be used to improve coherence protocol responsiveness. In order to achieve this, a new coherence protocol, named LOCKE, is proposed. LOCKE successfully exploits large on-chip bandwidth availability to improve cache-coherent chip multiprocessor performance and energy efficiency. Provided that the interconnection network is designed to support multicast traffic and the protocol maximizes the potential advantages that direct coherence brings, we demonstrate that a multicast-based coherence protocol could reduce energy requirements in the CMP memory hierarchy. The key idea presented is to establish a suitable level of on-chip network throughput to accelerate synchronization by two means: avoiding the protocol serialization, inherent to directory-based coherence protocol, and reducing average access time more than in other snoop-based coherence protocols, when shared data is truly contended. LOCKE is developed on top of a Token coherence performance substrate, with a new set of simple proactive policies that speeds up data synchronization and eliminates the passive token starvation avoidance mechanism. Using a full-system simulator that faithfully models on-chip interconnection, aggressive core architecture and precise memory hierarchy details, while running a broad spectrum of workloads, our proposal can improve both directory-based and token-based coherence protocols both in terms of energy and performance, at least in systems with up to 16 aggressive out-of-order processors in the chip.