Reducing Critical Failures for Control Algorithms Using Executable Assertions and Best Effort Recover

Authors:
Jonny Vinter;Joakim Aidemark;Peter Folkesson;Johan Karlsson
Affiliations:
-;-;-;-
Venue:
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Year:
2001

Citing 5
Cited 5

The MAFT Architecture for Distributed Fault Tolerance

IEEE Transactions on Computers - Fault-Tolerant Computing
Understanding fault-tolerant distributed systems

Communications of the ACM
GOOFI: Generic Object-Oriented Fault Injection Tool

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
A Comparison of Simulation Based and Scan Chain Implemented Fault Injection

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Byzantine clock synchronization

PODC '84 Proceedings of the third annual ACM symposium on Principles of distributed computing

Reset-Driven Fault Tolerance

EDCC-4 Proceedings of the 4th European Dependable Computing Conference on Dependable Computing
GOOFI: Generic Object-Oriented Fault Injection Tool

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Experiences with the design of a run-time check

SAFECOMP'06 Proceedings of the 25th international conference on Computer Safety, Reliability, and Security
On the effectiveness of run-time checks

SAFECOMP'05 Proceedings of the 24th international conference on Computer Safety, Reliability, and Security
Failure boundedness in discrete applications

LADC'07 Proceedings of the Third Latin-American conference on Dependable Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Abstract: Systems that use f+1 computer nodes to tolerate f node failures ordinarily require that the computer nodes have strong failure semantics, i.e. a node should either produce correct results, or no results at all. We show that this requirement can be relaxed for control applications, as control algorithms inherently compensate for a class of value failures. Value failures occur when an error escapes the error detection mechanisms in the computer node and an erroneous value is sent to the actuators of the control system. Fault injection experiments show that 89% of the value failures caused by bit-flips in a CPU had no or minor impact on the controlled object. However, the experiments also show that 11% of the value failures had severe consequences. These failures were caused by bit-flips affecting the state variables of the control algorithm. Another set of fault injection experiments show that the percentage of the value failures with severe consequences was reduced to 3% when the state variables were protected with executable assertions and best effort recovery mechanisms.