Fault-Tolerant Wormhole Routing in Meshes without Virtual Channels
IEEE Transactions on Parallel and Distributed Systems
The Reliable Router: A Reliable and High-Performance Communication Substrate for Parallel Computers
PCRCW '94 Proceedings of the First International Workshop on Parallel Computer Routing and Communication
Low Power Error Resilient Encoding for On-Chip Data Buses
Proceedings of the conference on Design, automation and test in Europe
A Fault-Tolerant and Deadlock-Free Routing Protocol in 2D Meshes Based on Odd-Even Turn Model
IEEE Transactions on Computers
A fault model notation and error-control scheme for switch-to-switch buses in a network-on-chip
Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
A New Approach to Fault-Tolerant Wormhole Routing for Mesh-Connected Parallel Computers
IEEE Transactions on Computers
Immunet: A Cheap and Robust Fault-Tolerant Packet Routing Mechanism
Proceedings of the 31st annual international symposium on Computer architecture
Multi-phase minimal fault-tolerant wormhole routing in meshes
Parallel Computing
Microarchitecture and Design Challenges for Gigascale Integration
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
An Efficient Fault-Tolerant Routing Methodology for Meshes and Tori
IEEE Computer Architecture Letters
A survey of research and practices of Network-on-chip
ACM Computing Surveys (CSUR)
Exploring Fault-Tolerant Network-on-Chip Architectures
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Reliability modeling and management in dynamic microprocessor-based systems
Proceedings of the 43rd annual Design Automation Conference
ElastIC: An Adaptive Self-Healing Architecture for Unpredictable Silicon
IEEE Design & Test
Proceedings of the conference on Design, automation and test in Europe
Efficient unicast and multicast support for CMPs
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
A highly resilient routing algorithm for fault-tolerant NoCs
Proceedings of the Conference on Design, Automation and Test in Europe
Design techniques for cross-layer resilience
Proceedings of the Conference on Design, Automation and Test in Europe
Cost-effective slack allocation for lifetime improvement in NoC-based MPSoCs
Proceedings of the Conference on Design, Automation and Test in Europe
A resilient on-chip router design through data path salvaging
Proceedings of the 16th Asia and South Pacific Design Automation Conference
Exploiting inherent information redundancy to manage transient errors in NoC routing arbitration
NOCS '11 Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip
Energy and reliability oriented mapping for regular Networks-on-Chip
NOCS '11 Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip
DRAIN: distributed recovery architecture for inaccessible nodes in multi-core chips
Proceedings of the 48th Design Automation Conference
Enabling system-level modeling of variation-induced faults in networks-on-chips
Proceedings of the 48th Design Automation Conference
ROBUST: a new self-healing fault-tolerant NoC router
Proceedings of the 4th International Workshop on Network on Chip Architectures
Optimizing built-in pseudo-random self-testing for network-on-chip switches
Proceedings of the 2012 Interconnection Network Architecture: On-Chip, Multi-Chip Workshop
A highly robust distributed fault-tolerant routing algorithm for NoCs with localized rerouting
Proceedings of the 2012 Interconnection Network Architecture: On-Chip, Multi-Chip Workshop
A systematic methodology to develop resilient cache coherence protocols
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Viper: virtual pipelines for enhanced reliability
Proceedings of the 39th Annual International Symposium on Computer Architecture
Structural Test and Diagnosis for Graceful Degradation of NoC Switches
Journal of Electronic Testing: Theory and Applications
A survey and taxonomy of on-chip monitoring of multicore systems-on-chip
ACM Transactions on Design Automation of Electronic Systems (TODAES)
NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Addressing network-on-chip router transient errors with inherent information redundancy
ACM Transactions on Embedded Computing Systems (TECS) - Special Section on Wireless Health Systems, On-Chip and Off-Chip Network Architectures
A complete self-testing and self-configuring NoC infrastructure for cost-effective MPSoCs
ACM Transactions on Embedded Computing Systems (TECS) - Special Section on Wireless Health Systems, On-Chip and Off-Chip Network Architectures
A fault tolerant NoC architecture using quad-spare mesh topology and dynamic reconfiguration
Journal of Systems Architecture: the EUROMICRO Journal
Journal of Electronic Testing: Theory and Applications
Methods for fault tolerance in networks-on-chip
ACM Computing Surveys (CSUR)
Cost-effective lifetime and yield optimization for NoC-based MPSoCs
ACM Transactions on Design Automation of Electronic Systems (TODAES)
uDIREC: unified diagnosis and reconfiguration for frugal bypass of NoC faults
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Online traffic-aware fault detection for networks-on-chip
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
Process scaling has given designers billions of transistors to work with. As feature sizes near the atomic scale, extensive variation and wearout inevitably make margining uneconomical or impossible. The ElastIC project seeks to address this by creating a large-scale chip-multiprocessor that can self-diagnose, adapt, and heal. Creating large, flexible designs in this environment naturally lends itself to the repetitive nature of network-on-chip (NoC), but the loss of a single link or router will result in complete network failure. In this work we present Vicis, an ElastIC-style NoC that can tolerate the loss of many network components due to wearout induced hard faults. Vicis uses the inherent redundancy in the network and its routers in order to maintain correct operation while incurring a much lower area overhead than previously proposed N-modular redundancy (NMR) based solutions. Each router has a built-in-self-test (BIST) that diagnoses the locations of hard fault and runs a number of algorithms to best use ECC, port swapping, and a crossbar bypass bus to mitigate them. The routers work together to run distributed algorithms to solve network-wide problems as well, protecting the networking against critical failures in individual routers. In this work we show that with stuck-at fault rates as high as 1 in 2000 gates, Vicis will continue to operate with approximately half of its routers still functional and communicating.