Transient Fault Recovery on Chip Multiprocessor based on Dual Core Redundancy and Context Saving

  • Authors:
  • Rui Gong;Kui Dai;Zhiying Wang

  • Affiliations:
  • -;-;-

  • Venue:
  • ICYCS '08 Proceedings of the 2008 The 9th International Conference for Young Computer Scientists
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

To address the increasing susceptibility of microprocessors to transient faults, many techniques have been proposed to exploit core redundancy of Chip Multiprocessors (CMPs). Chip-level Redundant Threading (CRT) is a novel approach to detect transient fault on CMPs by executing two copies of a given program on separate cores and comparing the store data. CRTR (CRT with Recovery) achieves fault recovery by comparing the result of every instruction before commit. Once detecting a nonidentical result, the microporcessor could be recovered by re-executing from the wrong instruction. The inter-core communication becomes critical in CRTR. To reduce the inter-core communication bandwidth demand, a new approach, Dual Core Redundancy with Context saving (DCR-C), is proposed for fault recovery in this paper. DCR-C extends CRT by adding hardware-implemented context saving and recovery. In DCR-C, only store instructions are compared before commit as in CRT, so that the bandwidth demand can be largely reduced. The context saving is triggered by store caused cache miss. Therefore the context saving latency could be efficiently hidden. Once detecting a fault, the processor could be recovered to the saved context. The experimental results demonstrate that DCR-C is a preferable approach to achieve fault recovery with low performance overhead and inter-core bandwidth demand.