Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor

  • Authors:
  • Christopher Weaver;Joel Emer;Shubhendu S. Mukherjee;Steven K. Reinhardt

  • Affiliations:
  • Intel Corporation, Hudson MA;Intel Corporation, Hudson MA;Intel Corporation, Hudson MA;Intel Corporation, Hudson MA/ University of Michigan, Ann Arbor

  • Venue:
  • Proceedings of the 31st annual international symposium on Computer architecture
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Transient faults due to neutron and alpha particle strikes posea significant obstacle to increasing processor transistor counts infuture technologies. Although fault rates of individual transistorsmay not rise significantly, incorporating more transistors into adevice makes that device more likely to encounter a fault. Hence,maintaining processor error rates at acceptable levels will requireincreasing design effort.This paper proposes two simple approaches to reduce errorrates and evaluates their application to a microprocessor instructionqueue. The first technique reduces the time instructions sit invulnerable storage structures by selectively squashing instructionswhen long delays are encountered. A fault is less likely to cause anerror if the structure it affects does not contain valid instructions.We introduce a new metric, MITF (Mean Instructions To Failure),to capture the trade-off between performance and reliability introducedby this approach.The second technique addresses false detected errors. In theabsence of a fault detection mechanism, such errors would nothave affected the final outcome of a program. For example, a faultaffecting the result of a dynamically dead instruction would notchange the final program output, but could still be flagged by thehardware as an error. To avoid signalling such false errors, wemodify a pipeline's error detection logic to mark affected instructionsand data as possibly incorrect rather than immediately signalingan error. Then, we signal an error only if we determine laterthat the possibly incorrect value could have affected the program'soutput.