ReStore: Symptom-Based Soft Error Detection in Microprocessors
IEEE Transactions on Dependable and Secure Computing
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
SlicK: slice-based locality exploitation for efficient redundant multithreading
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection
Proceedings of the International Symposium on Code Generation and Optimization
Transient fault prediction based on anomalies in processor events
Proceedings of the conference on Design, automation and test in Europe
Techniques for Efficient Software Checking
Languages and Compilers for Parallel Computing
Exploiting selective placement for low-cost memory protection
ACM Transactions on Architecture and Code Optimization (TACO)
ESoftCheck: Removal of Non-vital Checks for Fault Tolerance
Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Selective replication: A lightweight technique for soft errors
ACM Transactions on Computer Systems (TOCS)
Shoestring: probabilistic soft error reliability on the cheap
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Design techniques for cross-layer resilience
Proceedings of the Conference on Design, Automation and Test in Europe
RISE: improving the streaming processors reliability against soft errors in gpgpus
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Cost-effective soft-error protection for SRAM-based structures in GPGPUs
Proceedings of the ACM International Conference on Computing Frontiers
Functional test-sequence grading at register-transfer level
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
On the Impact of Performance Faults in Modern Microprocessors
Journal of Electronic Testing: Theory and Applications
Epipe: A low-cost fault-tolerance technique considering WCET constraints
Journal of Systems Architecture: the EUROMICRO Journal
Hi-index | 0.00 |
Device scaling and large scale integration have led to growing concerns about soft errors in microprocessors. To date, in all but the most demanding applications, implementing parity and ECC for caches and other large, regular SRAM structures have been sufficient to stem the growing soft error tide. This will not be the case for long, and questions remain as to the best way to detect and recover from soft errors in the remainder of the processor 驴 in particular, the less structured execution core. In this work, we propose the ReStore architecture, which leverages existing performance enhancing checkpointing hardware to recover from soft error events in a low cost fashion. Error detection in the ReStore architecture is novel: symptoms that hint at the presence of soft errors trigger restoration of a previous checkpoint. Example symptoms include exceptions, control flow mis-speculations, and cache or translation look-aside buffer misses. Compared to conventional soft error detection via full replication, the ReStore framework incurs little overhead, but sacrifices some amount of error coverage. These attributes make it an ideal means to provide very cost effective error coverage for processor applications that can tolerate a non-zero, but small, soft error failure rate. Our evaluation of an example ReStore implementation exhibits a 2x increase in MTBF (mean time between failures) over a standard pipeline with minimal hardware and performance overheads. The MTBF increases by 7x if ReStore is coupled with parity protection for certain pipeline structures.