IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Transient fault detection via simultaneous multithreading
Proceedings of the 27th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Transient-fault recovery using simultaneous multithreading
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Detailed design and evaluation of redundant multithreading alternatives
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Dual use of superscalar datapath for transient-fault detection and recovery
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A Fault Tolerant Approach to Microprocessor Design
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Transient-fault recovery for chip multiprocessors
Proceedings of the 30th annual international symposium on Computer architecture
Minos: Control Data Attack Prevention Orthogonal to Memory Model
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Opportunistic Transient-Fault Detection
Proceedings of the 32nd annual international symposium on Computer Architecture
Software-controlled fault tolerance
ACM Transactions on Architecture and Code Optimization (TACO)
Self-checking instructions: reducing instruction redundancy for concurrent error detection
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Application-Level Correctness and its Impact on Fault Tolerance
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Protective redundancy overhead reduction using instruction vulnerability factor
Proceedings of the 7th ACM international conference on Computing frontiers
Using hardware vulnerability factors to enhance AVF analysis
Proceedings of the 37th annual international symposium on Computer architecture
Assuring application-level correctness against soft errors
Proceedings of the International Conference on Computer-Aided Design
Leveraging variable function resilience for selective software reliability on unreliable hardware
Proceedings of the Conference on Design, Automation and Test in Europe
Exploiting program-level masking and error propagation for constrained reliability optimization
Proceedings of the 50th Annual Design Automation Conference
Improving the fault resilience of an H.264 decoder using static analysis methods
ACM Transactions on Embedded Computing Systems (TECS) - Special Section on ESTIMedia'10
Journal of Electronic Testing: Theory and Applications
Hi-index | 0.00 |
As voltages decrease, soft errors are expected to become an increasing problem in maintaining program correctness. Unfortunately, previous mechanisms to improve processor reliability protect all processor instructions equally, causing such approaches to suffer from significant performance degradation and/or substantial hardware overhead. However, recent research has shown that in multimedia applications such as photography, video, and audio, not all instructions are created equal: many operations prove to be far more tolerant to faults than others [1]. This observation can be leveraged to limit the cost of reliable computing by protecting only those instructions that are critical to correct execution. We propose a mechanism to protect against soft errors through selective instruction replication. We begin with a dynamic instruction replication framework that replicates every instruction and checks them upon commit, rolling back for any inconsistent results. Instead of replicating the entire program, instructions that the compiler identifies as tolerant to error would remain unprotected. While full replication requires 40% to 100% overhead, our mechanism requires only 30% to 75% overhead, reducing the overhead by 15-33% with minimal hardware overhead. We suffer only 0.5 - 1% fidelity degradation with this approach.