Vicis: a reliable network for unreliable silicon

Authors:
David Fick;Andrew DeOrio;Jin Hu;Valeria Bertacco;David Blaauw;Dennis Sylvester
Affiliations:
University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI
Venue:
Proceedings of the 46th Annual Design Automation Conference
Year:
2009

Citing 18
Cited 23

Fault-Tolerant Wormhole Routing in Meshes without Virtual Channels

IEEE Transactions on Parallel and Distributed Systems
The Reliable Router: A Reliable and High-Performance Communication Substrate for Parallel Computers

PCRCW '94 Proceedings of the First International Workshop on Parallel Computer Routing and Communication
Low Power Error Resilient Encoding for On-Chip Data Buses

Proceedings of the conference on Design, automation and test in Europe
A Fault-Tolerant and Deadlock-Free Routing Protocol in 2D Meshes Based on Odd-Even Turn Model

IEEE Transactions on Computers
A fault model notation and error-control scheme for switch-to-switch buses in a network-on-chip

Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
A New Approach to Fault-Tolerant Wormhole Routing for Mesh-Connected Parallel Computers

IEEE Transactions on Computers
Immunet: A Cheap and Robust Fault-Tolerant Packet Routing Mechanism

Proceedings of the 31st annual international symposium on Computer architecture
Multi-phase minimal fault-tolerant wormhole routing in meshes

Parallel Computing
Microarchitecture and Design Challenges for Gigascale Integration

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation

IEEE Micro
An Efficient Fault-Tolerant Routing Methodology for Meshes and Tori

IEEE Computer Architecture Letters
A survey of research and practices of Network-on-chip

ACM Computing Surveys (CSUR)
Exploring Fault-Tolerant Network-on-Chip Architectures

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Reliability modeling and management in dynamic microprocessor-based systems

Proceedings of the 43rd annual Design Automation Conference
ElastIC: An Adaptive Self-Healing Architecture for Unpredictable Silicon

IEEE Design & Test
A framework for system reliability analysis considering both system error tolerance and component test quality

Proceedings of the conference on Design, automation and test in Europe
Efficient unicast and multicast support for CMPs

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
A highly resilient routing algorithm for fault-tolerant NoCs

Proceedings of the Conference on Design, Automation and Test in Europe

Design techniques for cross-layer resilience

Proceedings of the Conference on Design, Automation and Test in Europe
Cost-effective slack allocation for lifetime improvement in NoC-based MPSoCs

Proceedings of the Conference on Design, Automation and Test in Europe
A resilient on-chip router design through data path salvaging

Proceedings of the 16th Asia and South Pacific Design Automation Conference
Exploiting inherent information redundancy to manage transient errors in NoC routing arbitration

NOCS '11 Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip
Energy and reliability oriented mapping for regular Networks-on-Chip

NOCS '11 Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip
DRAIN: distributed recovery architecture for inaccessible nodes in multi-core chips

Proceedings of the 48th Design Automation Conference
Enabling system-level modeling of variation-induced faults in networks-on-chips

Proceedings of the 48th Design Automation Conference
ROBUST: a new self-healing fault-tolerant NoC router

Proceedings of the 4th International Workshop on Network on Chip Architectures
Optimizing built-in pseudo-random self-testing for network-on-chip switches

Proceedings of the 2012 Interconnection Network Architecture: On-Chip, Multi-Chip Workshop
A highly robust distributed fault-tolerant routing algorithm for NoCs with localized rerouting

Proceedings of the 2012 Interconnection Network Architecture: On-Chip, Multi-Chip Workshop
A systematic methodology to develop resilient cache coherence protocols

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Viper: virtual pipelines for enhanced reliability

Proceedings of the 39th Annual International Symposium on Computer Architecture
Structural Test and Diagnosis for Graceful Degradation of NoC Switches

Journal of Electronic Testing: Theory and Applications
A survey and taxonomy of on-chip monitoring of multicore systems-on-chip

ACM Transactions on Design Automation of Electronic Systems (TODAES)
NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Addressing network-on-chip router transient errors with inherent information redundancy

ACM Transactions on Embedded Computing Systems (TECS) - Special Section on Wireless Health Systems, On-Chip and Off-Chip Network Architectures
A complete self-testing and self-configuring NoC infrastructure for cost-effective MPSoCs

ACM Transactions on Embedded Computing Systems (TECS) - Special Section on Wireless Health Systems, On-Chip and Off-Chip Network Architectures
A fault tolerant NoC architecture using quad-spare mesh topology and dynamic reconfiguration

Journal of Systems Architecture: the EUROMICRO Journal
Partial Virtual Channel Sharing: A Generic Methodology to Enhance Resource Management and Fault Tolerance in Networks-on-Chip

Journal of Electronic Testing: Theory and Applications
Methods for fault tolerance in networks-on-chip

ACM Computing Surveys (CSUR)
Cost-effective lifetime and yield optimization for NoC-based MPSoCs

ACM Transactions on Design Automation of Electronic Systems (TODAES)
uDIREC: unified diagnosis and reconfiguration for frugal bypass of NoC faults

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Online traffic-aware fault detection for networks-on-chip

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Process scaling has given designers billions of transistors to work with. As feature sizes near the atomic scale, extensive variation and wearout inevitably make margining uneconomical or impossible. The ElastIC project seeks to address this by creating a large-scale chip-multiprocessor that can self-diagnose, adapt, and heal. Creating large, flexible designs in this environment naturally lends itself to the repetitive nature of network-on-chip (NoC), but the loss of a single link or router will result in complete network failure. In this work we present Vicis, an ElastIC-style NoC that can tolerate the loss of many network components due to wearout induced hard faults. Vicis uses the inherent redundancy in the network and its routers in order to maintain correct operation while incurring a much lower area overhead than previously proposed N-modular redundancy (NMR) based solutions. Each router has a built-in-self-test (BIST) that diagnoses the locations of hard fault and runs a number of algorithms to best use ECC, port swapping, and a crossbar bypass bus to mitigate them. The routers work together to run distributed algorithms to solve network-wide problems as well, protecting the networking against critical failures in individual routers. In this work we show that with stuck-at fault rates as high as 1 in 2000 gates, Vicis will continue to operate with approximately half of its routers still functional and communicating.