Fault-tolerant computer system design
Fault-tolerant computer system design
Design and Evaluation of System-Level Checks for On-Line Control Flow Error Detection
IEEE Transactions on Parallel and Distributed Systems
Reliability Issues in Computing System Design
ACM Computing Surveys (CSUR)
Scheduling Policies for Fault Tolerance in a VLSI Processor
Proceedings of the The IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems
Embryonics + Immunotronics: A Bio-Inspired Approach to Fault Tolerance
EH '00 Proceedings of the 2nd NASA/DoD workshop on Evolvable Hardware
Experimental evaluation of the fail-silent behaviour in programs with consistency checks
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Hi-index | 0.00 |
Based on semantics of an application processing logic, we find out the most critical and sensitive parts of an application and we derive set of conditions or assertions among the various diagnostic checkpoint variables and we enhance the processing logic to enable it to detect run-time various operational or environmental faults toward fault tolerant computing. This paper examines how a single-version algorithm can establish software based fault tolerance by designing in thoughtful software based execution-time checks in a computing application. The algorithm developed here relies on various assertions that are derived from the semantics of an application. Various diagnostic assertive checkpoints have been derived based on an application's semantics. This work is not intended to correct bit-errors using conventional error correction codes. Errors have been detected through checkpoints and periodical execution of an application with known test data and verification of observed result with known result thereof. Electrical transients or small particles hitting the circuit, often cause random errors or faults in data and program flow. The manuscript describes an algorithm that allows the detection and recovery of transient or operational failures in software on a specific problem, just by using one version of a software program running on just one machine. This approach does not aim to tolerate software design bugs. This algorithmic approach uses various run-time signatures and validation thereof in order to detect faults.