Exploiting microarchitecture insights for efficient fault tolerance

  • Authors:
  • Eric Rotenberg;Vimal Kodandarama Reddy

  • Affiliations:
  • North Carolina State University;North Carolina State University

  • Venue:
  • Exploiting microarchitecture insights for efficient fault tolerance
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Technology scaling makes transistors more susceptible to transient faults. As a result, it is becoming increasingly important to incorporate transient fault tolerance in future processors. Traditional transient fault tolerance approaches duplicate in time or space for robust fault tolerance, but are expensive in terms of performance, area, and power, counteracting the very benefits of technology scaling. To make fault tolerance viable for commodity processors, unconventional techniques are needed that provide significant fault protection in an efficient manner. In this spirit, this thesis presents two low-overhead approaches to fault tolerance based on microarchitecture insights. First, prediction-based partial redundant threading (PRT) is presented as a low-overhead alternative to full redundant multithreading (RMT). In RMT, two copies of a program are executed on a simultaneous multithreading (SMT) substrate. Outcomes of duplicated instructions are compared to detect transient faults in the processor. RMT incurs high performance and power overheads due to full redundant execution (as high as 40% slowdown). In prediction-based PRT, confident predictions are leveraged as effective proxies for redundant execution, based on the idea that a correct prediction of an instruction's outcome is the same as the outcome produced by fault-free execution of the instruction. Confidently-predicted instructions and their producers are skipped in the redundant thread (as many as 57% instructions skipped). This predictive thread is shown to be as effective as a full thread for checking purposes, but much more efficient. Second, a superscalar processor is designed with built-in checks that indirectly detect low-level transient faults, by observing microarchitecture-level anomalies they cause. A single check covers many logic blocks, similar in spirit to outcome checks in RMT, but without the overheads of redundant execution. This dissertation develops several novel microarchitecture-level fault checks for protecting critical superscalar processor structures. Most notably, (1) inherent time redundancy (ITR) exploits program repetition to detect faults in decode signals, thereby covering the fetch and decode units, (2) register name authentication (RNA) asserts consistencies among renaming structures to detect faults affecting register renaming, and (3) timestamp-based assertion checking (TAC) asserts sequential order among dependent instructions to detect faults affecting dynamic instruction scheduling. Based on these checks, a fault-checking regimen is engaged to comprehensively protect a superscalar processor pipeline. To evaluate fault tolerance of the processor, a new fault injection strategy is developed. It involves analyzing the microarchitecture of a superscalar processor in depth and identifying high-level faults which can be modeled in a timing simulator, enabling a fast and reasonably accurate evaluation. Exclusive fault injection experiments reveal that the new fault-checking regimen provides substantial fault coverage to the processor, making the case for a canonical fault-tolerant superscalar processor.