ReStore: Symptom-Based Soft Error Detection in Microprocessors

Authors:
Nicholas J. Wang;Sanjay J. Patel
Affiliations:
IEEE;IEEE
Venue:
IEEE Transactions on Dependable and Secure Computing
Year:
2006

Citing 17
Cited 19

Assigning confidence to conditional branch predictions

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Improving data cache performance by pre-executing instructions under a cache miss

ICS '97 Proceedings of the 11th international conference on Supercomputing
Confidence estimation for speculation control

Proceedings of the 25th annual international symposium on Computer architecture
Transient fault detection via simultaneous multithreading

Proceedings of the 27th annual international symposium on Computer architecture
Concurrent Error Detection Using Watchdog Processors-A Survey

IEEE Transactions on Computers
A 1.3GHz fifth generation SPARC64 microprocessor

Proceedings of the 40th annual Design Automation Conference
Characterization of essential dynamic instructions

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Incorporating fault tolerance in superscalar processors

HIPC '96 Proceedings of the Third International Conference on High-Performance Computing (HiPC '96)
Fast Path-Based Neural Branch Prediction

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor

Proceedings of the 31st annual international symposium on Computer architecture
Fingerprinting: bounding soft-error detection latency and bandwidth

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Characterization of Soft Errors Caused by Single Event Upsets in CMOS Processes

IEEE Transactions on Dependable and Secure Computing
Robust System Design with Built-In Soft-Error Resilience

Computer
Perceptron-Based Branch Confidence Estimation

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
ReStore: Symptom Based Soft Error Detection in Microprocessors

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective

IBM Journal of Research and Development

Understanding the propagation of hard errors to software and implications for resilient system design

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Architecture Design for Soft Errors

Architecture Design for Soft Errors
Software-assisted hardware reliability: abstracting circuit-level challenges to the software stack

Proceedings of the 46th Annual Design Automation Conference
mSWAT: low-cost hardware fault detection and diagnosis for multicore systems

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Shoestring: probabilistic soft error reliability on the cheap

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Eliminating voltage emergencies via software-guided code transformations

ACM Transactions on Architecture and Code Optimization (TACO)
An event-guided approach to reducing voltage noise in processors

Proceedings of the Conference on Design, Automation and Test in Europe
Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Assuring application-level correctness against soft errors

Proceedings of the International Conference on Computer-Aided Design
Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Encore: low-cost, fine-grained transient fault recovery

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Using dynamic task level redundancy for OpenMP fault tolerance

ARCS'12 Proceedings of the 25th international conference on Architecture of Computing Systems
Efficient soft error protection for commodity embedded microprocessors using profile information

Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Dynamic code duplication with vulnerability awareness for soft error detection on VLIW architectures

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Low cost control flow protection using abstract control signatures

Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
FaulTM: error detection and recovery using hardware transactional memory

Proceedings of the Conference on Design, Automation and Test in Europe
CSER: HW/SW configurable soft-error resiliency for application specific instruction-set processors

Proceedings of the Conference on Design, Automation and Test in Europe
CrashTest'ing SWAT: accurate, gate-level evaluation of symptom-based resiliency solutions

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
A survey of checker architectures

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.01

Visualization

Abstract

Device scaling and large-scale integration have led to growing concerns about soft errors in microprocessors. To date, in all but the most demanding applications, implementing parity and ECC for caches and other large, regular SRAM structures have been sufficient to stem the growing soft error tide. This will not be the case for long and questions remain as to the best way to detect and recover from soft errors in the remainder of the processor—in particular, the less structured execution core. In this work, we propose the ReStore architecture, which leverages existing performance enhancing checkpointing hardware to recover from soft error events in a low cost fashion. Error detection in the ReStore architecture is novel: symptoms that hint at the presence of soft errors trigger restoration of a previous checkpoint. Example symptoms include exceptions, control flow misspeculations, and cache or translation look-aside buffer misses. Compared to conventional soft error detection via full replication, the ReStore framework incurs little overhead, but sacrifices some amount of error coverage. These attributes make it an ideal means to provide very cost effective error coverage for processor applications that can tolerate a nonzero, but small, soft error failure rate. Our evaluation of an example ReStore implementation exhibits a 2x increase in MTBF (mean time between failures) over a standard pipeline with minimal hardware and performance overheads. The MTBF increases by 20x if ReStore is coupled with protection for certain particularly vulnerable pipeline structures.