CHIPPER: A low-complexity bufferless deflection router

Authors:
Chris Fallin;Chris Craik;Onur Mutlu
Affiliations:
Computer Architecture Lab (CALCM), Carnegie Mellon University;Computer Architecture Lab (CALCM), Carnegie Mellon University;Computer Architecture Lab (CALCM), Carnegie Mellon University
Venue:
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Year:
2011

Citing 0
Cited 12

Making-a-stop: A new bufferless routing algorithm for on-chip network

Journal of Parallel and Distributed Computing
Benefits of selective packet discard in networks-on-chip

ACM Transactions on Architecture and Code Optimization (TACO)
On-chip networks from a networking perspective: congestion and scalability in many-core interconnects

Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
On-chip networks from a networking perspective: congestion and scalability in many-core interconnects

ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
LIGERO: A light but efficient router conceived for cache-coherent chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
NoRD: Node-Router Decoupling for Effective Power-gating of On-Chip Routers

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
DeBAR: deflection based adaptive router with minimal buffering

Proceedings of the Conference on Design, Automation and Test in Europe
A heterogeneous multiple network-on-chip design: an application-aware approach

Proceedings of the 50th Annual Design Automation Conference
Deflection routing in 3D network-on-chip with limited vertical bandwidth

ACM Transactions on Design Automation of Electronic Systems (TODAES) - Special Section on Networks on Chip: Architecture, Tools, and Methodologies
Scalable high-radix router microarchitecture using a network switch organization

ACM Transactions on Architecture and Code Optimization (TACO)
Traffic steering between a low-latency unswitched TL ring and a high-throughput switched on-chip interconnect

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
TornadoNoC: A lightweight and scalable on-chip network architecture for the many-core era

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

As Chip Multiprocessors (CMPs) scale to tens or hundreds of nodes, the interconnect becomes a significant factor in cost, energy consumption and performance. Recent work has explored many design tradeoffs for networks-on-chip (NoCs) with novel router architectures to reduce hardware cost. In particular, recent work proposes bufferless deflection routing to eliminate router buffers. The high cost of buffers makes this choice potentially appealing, especially for low-to-medium network loads. However, current bufferless designs usually add complexity to control logic. Deflection routing introduces a sequential dependence in port allocation, yielding a slow critical path. Explicit mechanisms are required for livelock freedom due to the non-minimal nature of deflection. Finally, deflection routing can fragment packets, and the reassembly buffers require large worst-case sizing to avoid deadlock, due to the lack of network backpressure. The complexity that arises out of these three problems has discouraged practical adoption of bufferless routing. To counter this, we propose CHIPPER (Cheap-Interconnect Partially Permuting Router), a simplified router microarchitecture that eliminates in-router buffers and the crossbar. We introduce three key insights: first, that deflection routing port allocation maps naturally to a permutation network within the router; second, that livelock freedom requires only an implicit token-passing scheme, eliminating expensive age-based priorities; and finally, that flow control can provide correctness in the absence of network backpressure, avoiding deadlock and allowing cache miss buffers (MSHRs) to be used as reassembly buffers. Using multiprogrammed SPEC CPU2006, server, and desktop application workloads and SPLASH-2 multithreaded workloads, we achieve an average 54.9% network power reduction for 13.6% average performance degradation (multipro-grammed) and 73.4% power reduction for 1.9% slowdown (multithreaded), with minimal degradation and large power savings at low-to-medium load. Finally, we show 36.2% router area reduction relative to buffered routing, with comparable timing.