Comparing the effects of intermittent and transient hardware faults on programs

  • Authors:
  • Jiesheng Wei;Layali Rashid;Karthik Pattabiraman;Sathish Gopalakrishnan

  • Affiliations:
  • Department of Electrical and Computer Engineering, The University of British Columbia, Canada;Department of Electrical and Computer Engineering, The University of British Columbia, Canada;Department of Electrical and Computer Engineering, The University of British Columbia, Canada;Department of Electrical and Computer Engineering, The University of British Columbia, Canada

  • Venue:
  • DSNW '11 Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The trends of shrinking device geometries, lower voltages and higher frequencies in modern processors are expected to increase the rate of intermittent faults. This requires the design of software that are resilient to intermittent faults. There has been substantial research on software systems that are resilient to transient faults. However, it is unclear whether the impact of intermittent faults on programs is similar to that of transient faults. This is important for deciding if we need novel techniques for tolerating intermittent faults in software. In this study, we attempt to answer this question by comparing the effects of intermittent and transient hardware faults on programs through fault-injection experiments performed in a micro-architectural simulator for a simple five-stage pipelined processor. We also investigate whether the differences (if any) vary with the length (i.e., duration in cycles) of the fault and with the micro-architectural unit in which the fault originates. The result show that intermittent faults' impact on programs are significantly different from those of transient faults, and that the difference depends both on the length of the fault and the fault's origin. Therefore, existing software techniques for ensuring resilience from transient faults may not be sufficient for intermittent faults, and new techniques are needed.