Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection

  • Authors:
  • Cheng Wang;Ho-seop Kim;Youfeng Wu;Victor Ying

  • Affiliations:
  • Intel Corporation;Intel Corporation;Intel Corporation;Intel Corporation

  • Venue:
  • Proceedings of the International Symposium on Code Generation and Optimization
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

As transistors become increasingly smaller and faster with tighter noise margins, modern processors are becoming increasingly more susceptible to transient hardware faults. Existing Hardware-based Redundant Multi-Threading (HRMT) approaches rely mostly on special-purpose hardware to replicate the program into redundant execution threads and compare their computation results. In this paper, we present a Software-based Redundant Multi-Threading (SRMT) approach for transient fault detection. Our SRMT technique uses compiler to automatically generate redundant threads so they can run on general-purpose chip multi-processors (CMPs). We exploit high-level program information available at compile time to optimize data communication between redundant threads. Furthermore, our software-based technique provides flexible program execution environment where the legacy binary codes and the reliability-enhanced codes can co-exist in a mix-and-match fashion, depending on the desired level of reliability and software compatibility. Our experimental results show that compiler analysis and optimization techniques can reduce data communication requirement by up to 88% of HRMT. With general-purpose intra-chip communication mechanisms in CMP machine, SRMT overhead can be as low as 19%. Moreover, SRMT technique achieves error coverage rates of 99.98% and 99.6% for SPEC CPU2000 integer and floating-point benchmarks, respectively. These results demonstrate the competitiveness of SRMT to HRMT approaches.