Field testing for cosmic ray soft errors in semiconductor memories
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
From system F to typed assembly language
ACM Transactions on Programming Languages and Systems (TOPLAS)
Multiple instruction issue in the NonStop cyclone processor
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Transient fault detection via simultaneous multithreading
Proceedings of the 27th annual international symposium on Computer architecture
ED4I: Error Detection by Diverse Data and Duplicated Instructions
IEEE Transactions on Computers - Special issue on fault-tolerant embedded systems
Transient-fault recovery using simultaneous multithreading
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Detailed design and evaluation of redundant multithreading alternatives
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
IBM's S/390 G5 Microprocessor Design
IEEE Micro
Concurrent Error Detection Using Watchdog Processors-A Survey
IEEE Transactions on Computers
Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Transient-fault recovery for chip multiprocessors
Proceedings of the 30th annual international symposium on Computer architecture
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Compiling with proofs
SWIFT: Software Implemented Fault Tolerance
Proceedings of the international symposium on Code generation and optimization
Design and Evaluation of Hybrid Fault-Detection Systems
Proceedings of the 32nd annual international symposium on Computer Architecture
A framework for unrestricted whole-program optimization
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Static typing for a faulty lambda calculus
Proceedings of the eleventh ACM SIGPLAN international conference on Functional programming
Reasoning about Control Flow in the Presence of Transient Faults
SAS '08 Proceedings of the 15th international symposium on Static Analysis
ESoftCheck: Removal of Non-vital Checks for Fault Tolerance
Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
EnerJ: approximate data types for safe and general low-power computation
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Faulty logic: reasoning about fault tolerant programs
ESOP'10 Proceedings of the 19th European conference on Programming Languages and Systems
Verifying quantitative reliability for programs that execute on unreliable hardware
Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
Improving the fault resilience of an H.264 decoder using static analysis methods
ACM Transactions on Embedded Computing Systems (TECS) - Special Section on ESTIMedia'10
Hi-index | 0.00 |
A transient hardware fault occurs when an energetic particle strikes a transistor, causing it to change state. Although transient faults do not permanently damage the hardware, they may corrupt computations by altering stored values and signal transfers. In this paper, we propose a new scheme for provably safe and reliable computing in the presence of transient hardware faults. In our scheme, software computations are replicated to provide redundancy while special instructions compare the independently computed results to detect errors before writing critical data. In stark contrast to any previous efforts in this area, we have analyzed our fault tolerance scheme from a formal, theoretical perspective. To be specific, first, we provide an operational semantics for our assembly language, which includes a precise formal definition of our fault model. Second, we develop an assembly-level type system designed to detect reliability problems in compiled code. Third, we provide a formal specification for program fault tolerance under the given fault model and prove that all well-typed programs are indeed fault tolerant. In addition to the formal analysis, we evaluate our detection scheme and show that it only takes 34% longer to execute than the unreliable version.