Detecting and recovering from in-core hardware faults through software anomaly treatment

  • Authors:
  • Sarita V. Adve;Pradeep Ramachandran

  • Affiliations:
  • University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign

  • Venue:
  • Detecting and recovering from in-core hardware faults through software anomaly treatment
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Aggressive scaling of CMOS transistors has enabled extensive system integration and building faster and more efficient systems. On the flip side, this has resulted in an increasing number of devices that fail in shipped components in-the-field for a variety of reasons including soft errors, wear-out failures, and infant mortality. The pervasiveness of the problem across a broad market demands low cost and generic reliability solutions, precluding traditional solutions that employed excessive redundancy or piecemeal solutions that address only a few failure modes. This dissertation presents SWAT (SoftWare Anomaly Treatment), a low cost resiliency solution that effectively handles hardware faults while incurring low cost during the common mode of fault-free operations. SWAT is based on two key observations about the design of resilient systems. First, only those hardware faults that affect software need to be handled and second, since the common mode of operation is fault-free, fault-free execution should incur near-zero overheads. SWAT thus uses novel zero to low cost hardware and software monitors that watch for anomalous software behavior to detect hardware faults. SWAT then relies on hardware support for checkpointing and rollback recovery. When dealing with fault recovery in the presence of I/O, we identify that existing software-level mechanisms that handle output buffering fall short. This dissertation therefore proposes a simple low-cost hardware buffer for output buffering and demonstrates that this strategy achieves high recoverability while incurring low overheads. Although not detailed in this dissertation, SWAT contains a comprehensive diagnosis procedure that is invoked in the rare event of a fault to isolate the root-cause of the fault by distinguishing between software bugs, transient hardware faults, and permanent hardware faults. Effectively, SWAT handles hardware faults uniformly as software bugs, amortizing the resiliency cost across both hardware and software reliability. The results in this dissertation show that the SWAT strategy is effective to detect and recover the system from a variety of in-core permanent and transient faults in various microarchitecture units for both compute-intensive and I/O-intensive workloads. In particular, this dissertation demonstrates that the SWAT detectors detect nearly all permanent and transient faults in most hardware units in both types of workloads, with only a small fraction of the faults corrupting application output.(Certain hardware structures like the FPU may need additional support to be amenable to software anomaly detection.) Further, a majority of these faults are tolerated by the applications due to their inherent fault-tolerant nature, resulting in only 0.2% of the injected faults affecting the application and yielding incorrect outputs (such faults are classified as Silent Data Corruptions, or SDCs). When attempting to recover the detected faults, we show that handling I/O is important for fault recovery. With our proposed low-cost hardware for output buffering, we show that over 94% of the detected faults are recoverable with low performance and area overheads during fault-free execution even in the presence of I/O. Finally, this dissertation builds a fundamental understanding behind why the SWAT strategy is effective for handling faults in modern workloads. The key insight is that the SWAT detectors are adept at detecting perturbations in control operations and memory addresses and a majority of the application values affect such operations. Faults in values that that never affect such operations are hard-to-detect and require additional support to be amenable to software anomaly detection. In summary, this dissertation presents SWAT as a complete solution to detect and recover from from in-core hardware faults. The techniques presented here therefore have far reaching implications on the design of low-cost solutions to handle unreliable hardware.