DRAIN: distributed recovery architecture for inaccessible nodes in multi-core chips

Authors:
Andrew DeOrio;Kostantinos Aisopos;Valeria Bertacco;Li-Shiuan Peh
Affiliations:
University of Michigan, Ann Arbor, MI;Princeton University, Princeton, NJ and Massachusetts Institute of Technology, Cambridge, MA;University of Michigan, Ann Arbor, MI;Massachusetts Institute of Technology, Cambridge, MA
Venue:
Proceedings of the 48th Design Automation Conference
Year:
2011

Citing 13
Cited 3

SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
A Fault-Tolerant and Deadlock-Free Routing Protocol in 2D Meshes Based on Odd-Even Turn Model

IEEE Transactions on Computers
A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Immunet: A Cheap and Robust Fault-Tolerant Packet Routing Mechanism

Proceedings of the 31st annual international symposium on Computer architecture
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
An Efficient Fault-Tolerant Routing Methodology for Meshes and Tori

IEEE Computer Architecture Letters
A concurrent testing method for NoC switches

Proceedings of the conference on Design, automation and test in Europe: Proceedings
Performance of Graceful Degradation for Cache Faults

ISVLSI '07 Proceedings of the IEEE Computer Society Annual Symposium on VLSI
Microprocessors in the era of terascale integration

Proceedings of the conference on Design, automation and test in Europe
Vicis: a reliable network for unreliable silicon

Proceedings of the 46th Annual Design Automation Conference
Segment-based routing: an efficient fault-tolerant routing algorithm for meshes and Tori

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

A systematic methodology to develop resilient cache coherence protocols

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Use it or lose it: wear-out and lifetime in future chip multiprocessors

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
uDIREC: unified diagnosis and reconfiguration for frugal bypass of NoC faults

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

As transistor dimensions continue to scale deep into the nanometer regime, silicon reliability is becoming a chief concern. At the same time, transistor counts are scaling up, enabling the design of highly integrated chips with many cores and a complex interconnect fabric, often a network on chip (NoC). Particularly problematic is the case when the accumulation of permanent hardware faults leads to disconnected cores in the system. In order to maintain correct system operation, it is necessary to salvage the data from these isolated nodes. In this work, we introduce a recovery mechanism targeting precisely this issue: DRAIN (Distributed Recovery Architecture for Inaccessible Nodes) provides system-level recovery from permanent failures. When an error disconnects a node from the network, DRAIN uses emergency links to transfer architectural state and cached data from disconnected nodes to nearby connected caches. DRAIN incurs zero performance penalty during normal operation, and is compatible with any cache coherence protocol, interconnect topology or routing protocol. Experimental results show that DRAIN is able to provide complete state recovery within several milliseconds, on average, for a 1GHz 64-node CMP at an area overhead of only a few thousand gates.