LIGERO: A light but efficient router conceived for cache-coherent chip multiprocessors

Authors:
Pablo Abad;Valentin Puente;Jose-Angel Gregorio
Affiliations:
University of Cantabria, Spain;University of Cantabria, Spain;University of Cantabria, Spain
Venue:
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Year:
2013

Citing 29
Cited 0

The Stanford Dash Multiprocessor

Computer
Route packets, not wires: on-chip inteconnection networks

Proceedings of the 38th annual Design Automation Conference
Interconnection Networks: An Engineering Approach

Interconnection Networks: An Engineering Approach
The Alpha 21364 Network Architecture

IEEE Micro
A Theory of Deadlock-Free Adaptive Multicast Routing in Wormhole Networks

IEEE Transactions on Parallel and Distributed Systems
A Progressive Approach to Handling Message-Dependent Deadlock in Parallel Computer Systems

IEEE Transactions on Parallel and Distributed Systems
The AMD Opteron Processor for Multiprocessor Servers

IEEE Micro
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Rotary router: an efficient architecture for CMP interconnection networks

Proceedings of the 34th annual international symposium on Computer architecture
Express virtual channels: towards the ideal interconnection fabric

Proceedings of the 34th annual international symposium on Computer architecture
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Token Coherence: A New Framework for Shared-Memory Multiprocessors

IEEE Micro
Token tenure: PATCHing token counting using directory-based cache coherence

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
A case for bufferless routing in on-chip networks

Proceedings of the 36th annual international symposium on Computer architecture
Scaling the bandwidth wall: challenges in and avenues for CMP scaling

Proceedings of the 36th annual international symposium on Computer architecture
SCARAB: a single cycle adaptive routing and bufferless network

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Aérgia: exploiting packet latency slack in on-chip networks

Proceedings of the 37th annual international symposium on Computer architecture
Adaptive and deadlock-free tree-based multicast routing for networks-on-chip

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Power-efficient tree-based multicast support for networks-on-chip

Proceedings of the 16th Asia and South Pacific Design Automation Conference
Dynamic and Robust Streaming in and between Connected Consumer-Electronic Devices

Dynamic and Robust Streaming in and between Connected Consumer-Electronic Devices
Atomic Coherence: Leveraging nanophotonics to build race-free cache coherence protocols

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
CHIPPER: A low-complexity bufferless deflection router

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Whole packet forwarding: Efficient design of fully adaptive routing algorithms for networks-on-chip

HPCA '12 Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture
TOPAZ: An Open-Source Interconnection Network Simulator for Chip Multiprocessors and Supercomputers

NOCS '12 Proceedings of the 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip
DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling

NOCS '12 Proceedings of the 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip
MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

NOCS '12 Proceedings of the 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although abstraction is the best approach to deal with computing system complexity, sometimes implementation details should be considered. Considering on-chip interconnection networks in particular, underestimating the underlying system specificity could have nonnegligible impact on performance, cost, or correctness. This article presents a very efficient router that has been devised to deal with cache-coherent chip multiprocessor particularities in a balanced way. Employing the same principles of packet rotation structures as in the rotary router, we present a router configuration with the following novel features: (1) reduced buffering requirements, (2) optimized pipeline under contentionless conditions, (3) more efficient deadlock avoidance mechanism, and (4) optimized in-order delivery guarantee. Putting it all together, our proposal provides a set of features that no other router, to the best of our knowledge, has achieved previously. These are: (1') low implementation cost, (2') low pass-through latency under low load, (3') improved resource utilization through adaptive routing and a buffering scheme free of head-of-line blocking, (4') guarantee of coherence protocol correctness via end-to-end deadlock avoidance and in-order delivery, and (5') improvement of coherence protocol responsiveness through adaptive in-network multicast support. We conduct a thorough evaluation that includes hardware cost estimation and performance evaluation under a wide spectrum of realistic workloads and coherence protocols. Comparing our proposal with VCTM, an optimized state-of-the-art wormhole router, it requires 50% less area, reduces on-chip cache hierarchy energy delay product on average by 20%, and improves the cache-coherency chip multiprocessor performance under realistic working conditions by up to 20%.