The MAFT Architecture for Distributed Fault Tolerance
IEEE Transactions on Computers - Fault-Tolerant Computing
Understanding fault-tolerant distributed systems
Communications of the ACM
GOOFI: Generic Object-Oriented Fault Injection Tool
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
A Comparison of Simulation Based and Scan Chain Implemented Fault Injection
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Byzantine clock synchronization
PODC '84 Proceedings of the third annual ACM symposium on Principles of distributed computing
EDCC-4 Proceedings of the 4th European Dependable Computing Conference on Dependable Computing
GOOFI: Generic Object-Oriented Fault Injection Tool
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Experiences with the design of a run-time check
SAFECOMP'06 Proceedings of the 25th international conference on Computer Safety, Reliability, and Security
On the effectiveness of run-time checks
SAFECOMP'05 Proceedings of the 24th international conference on Computer Safety, Reliability, and Security
Failure boundedness in discrete applications
LADC'07 Proceedings of the Third Latin-American conference on Dependable Computing
Hi-index | 0.00 |
Abstract: Systems that use f+1 computer nodes to tolerate f node failures ordinarily require that the computer nodes have strong failure semantics, i.e. a node should either produce correct results, or no results at all. We show that this requirement can be relaxed for control applications, as control algorithms inherently compensate for a class of value failures. Value failures occur when an error escapes the error detection mechanisms in the computer node and an erroneous value is sent to the actuators of the control system. Fault injection experiments show that 89% of the value failures caused by bit-flips in a CPU had no or minor impact on the controlled object. However, the experiments also show that 11% of the value failures had severe consequences. These failures were caused by bit-flips affecting the state variables of the control algorithm. Another set of fault injection experiments show that the percentage of the value failures with severe consequences was reduced to 3% when the state variables were protected with executable assertions and best effort recovery mechanisms.