The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Token coherence: decoupling performance and correctness
Proceedings of the 30th annual international symposium on Computer architecture
Immunet: A Cheap and Robust Fault-Tolerant Packet Routing Mechanism
Proceedings of the 31st annual international symposium on Computer architecture
Soft Errors in Advanced Computer Systems
IEEE Design & Test
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
The PARSEC benchmark suite: characterization and architectural implications
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Vicis: a reliable network for unreliable silicon
Proceedings of the 46th Annual Design Automation Conference
IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective
IBM Journal of Research and Development
DRAIN: distributed recovery architecture for inaccessible nodes in multi-core chips
Proceedings of the 48th Design Automation Conference
Enabling system-level modeling of variation-induced faults in networks-on-chips
Proceedings of the 48th Design Automation Conference
ARIADNE: Agnostic Reconfiguration in a Disconnected Network Environment
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Error control schemes for on-chip communication links: the energy-reliability tradeoff
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
ForEVeR: A complementary formal and runtime verification approach to correct NoC functionality
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
uDIREC: unified diagnosis and reconfiguration for frugal bypass of NoC faults
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Hi-index | 0.00 |
Aggressive transistor scaling continues to increase integration capacity with each new technology node, but technology downscaling also increases the vulnerability of semiconductor devices and causes silicon failures. Thus, fault-tolerant architectures are emerging to guarantee reliable functionality on unreliable silicon. While tolerating faults within a processor core has been extensively researched, the many-core era introduces the challenge of reliable on-chip communication in Chip Multi-Processors (CMPs). In CMP systems, an unreliable interconnection network can lose or corrupt coherence messages, causing the entire chip to deadlock. In this work, we argue for a system-level resiliency solution to tolerate an unreliable underlying Network-on-Chip (NoC). We introduce a systematic methodology to transform a coherence protocol to a resilient one, by extending its Finite State Machine (FSM) with safe states and incorporating additional handshaking messages into transactions. The modified protocol ensures coherent and reliable transactions over any lossy NoC. Our approach is generic and can be applied to a wide range of protocols. It requires minimal hardware modifications and introduces only a slight performance overhead (an average of 0.8% during fault-free operation, and 1.9% even at an aggressive fault rate of one fault per msec).