Application semantic driven assertions toward fault tolerant computing

Authors:
Goutam Kumar Saha
Affiliations:
Scientist-F, Centre for Development of Advanced Computing, Kolkata, West Bengal, India
Venue:
Ubiquity
Year:
2006

Citing 10
Cited 3

Fault-tolerant computer system design

Fault-tolerant computer system design
Design and Evaluation of System-Level Checks for On-Line Control Flow Error Detection

IEEE Transactions on Parallel and Distributed Systems
Reliability Issues in Computing System Design

ACM Computing Surveys (CSUR)
Scheduling Policies for Fault Tolerance in a VLSI Processor

Proceedings of the The IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems
Embryonics + Immunotronics: A Bio-Inspired Approach to Fault Tolerance

EH '00 Proceedings of the 2nd NASA/DoD workshop on Evolvable Hardware
Experimental evaluation of the fail-silent behaviour in programs with consistency checks

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
A software fix towards fault-tolerant computing

Ubiquity
Transient software fault tolerance using single-version algorithm

Ubiquity
Software-Based Fault Tolerant Computing

Ubiquity
Software implemented fault tolerance through data error recovery

Ubiquity

Self-healing Software

Ubiquity
Replicated instruction based fault tolerant computing

Ubiquity
Software-Implemented Fault Detection Approaches

Ubiquity

Quantified Score

Hi-index	0.00

Visualization

Abstract

Based on semantics of an application processing logic, we find out the most critical and sensitive parts of an application and we derive set of conditions or assertions among the various diagnostic checkpoint variables and we enhance the processing logic to enable it to detect run-time various operational or environmental faults toward fault tolerant computing. This paper examines how a single-version algorithm can establish software based fault tolerance by designing in thoughtful software based execution-time checks in a computing application. The algorithm developed here relies on various assertions that are derived from the semantics of an application. Various diagnostic assertive checkpoints have been derived based on an application's semantics. This work is not intended to correct bit-errors using conventional error correction codes. Errors have been detected through checkpoints and periodical execution of an application with known test data and verification of observed result with known result thereof. Electrical transients or small particles hitting the circuit, often cause random errors or faults in data and program flow. The manuscript describes an algorithm that allows the detection and recovery of transient or operational failures in software on a specific problem, just by using one version of a software program running on just one machine. This approach does not aim to tolerate software design bugs. This algorithmic approach uses various run-time signatures and validation thereof in order to detect faults.