Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Authors:
Karin Strauss;Xiaowei Shen;Josep Torrellas
Affiliations:
-;-;-
Venue:
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2007

Citing 0
Cited 14

Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Distributed cooperative caching

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Token tenure: PATCHing token counting using directory-based cache coherence

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Push-assisted migration of real-time tasks in multi-core processors

Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
In-network coherence filtering: snoopy coherence without broadcasts

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors

Proceedings of the 37th annual international symposium on Computer architecture
Token tenure and PATCH: A predictive/adaptive token-counting hybrid

ACM Transactions on Architecture and Code Optimization (TACO)
A composite and scalable cache coherence protocol for large scale CMPs

Proceedings of the international conference on Supercomputing
Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Improving coherence protocol reactiveness by trading bandwidth for latency

Proceedings of the 9th conference on Computing Frontiers
A hybrid NoC design for cache coherence optimization for chip multiprocessors

Proceedings of the 49th Annual Design Automation Conference
LIGERO: A light but efficient router conceived for cache-coherent chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Dual partitioning multicasting for high-performance on-chip networks

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Snoopy cache coherence can be implemented in any physical network topology by embedding a logical unidirectional ring in the network. Control messages are forwarded using the ring, while other messages can use any path. While the resulting coherence protocols are inexpensive to implement, they enable many ways of overlapping multiple transactions that access the same line -- mak- ing it hard to reason about correctness. Moreover, snoop requests are required to traverse the ring, therefore lengthening coherence transaction latencies. In this paper, we address these problems and make two main contributions. First, we introduce the Ordering invariant, which ensures the correct serialization of colliding transactions in embedded-ring protocols. Second, based on this invariant, we re- move the requirement that snoop requests traverse the ring. In- stead, they are delivered using any network path, as long as snoop responses -- which are typically off the critical path -- use the logi- cal ring. This approach substantially reduces coherence transaction latency. We call the resulting protocol Uncorq. We show that, on a 64-node Chip Multiprocessor (CMP), Un- corq improves the performance, on average, by 23% for SPLASH-2 applications and by 10% for commercial applications. With an ad- ditional simple prefetching optimization, the performance improve- ment is, on average, 26% for SPLASH-2 applications and 18% for commercial applications.