Fault-Injection-Based Testing Of Fault-Tolerant Algorithms In Message-Passing Parallel Computers

Authors:
Douglas M. Blough;Tatsuhiro Torii
Affiliations:
Univ. of California, Irvine, CA;Univ. of California, Irvine, CA
Venue:
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Year:
1997

Citing 0
Cited 2

FITS: a fault injection architecture for time-triggered systems

ACSC '03 Proceedings of the 26th Australasian computer science conference - Volume 16
Emulation of Software Faults: A Field Data Study and a Practical Approach

IEEE Transactions on Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distributed-memory parallel computers offer inherent redundancy that can be exploited to provide software-implemented fault tolerance. Numerous algorithms have been developed for fault-tolerant unicast communication, fault-tolerant broadcast, fault diagnosis, check-point/rollback, various consensus problems, algorithm-based fault tolerance, etc. Correctness proofs for these algorithms tend to be quite complex and, as a result, are error-prone. Furthermore, the way in which an algorithm is implemented can have dramatic impact on its correctness. Fault-injection-based testing is, therefore, an essential component of the validation procedure for these algorithms, which can complement other methods such as formal verification. The authors present a methodology for fault injection in distributed-memory parallel computers that use a message-passing paradigm. Their approach is based on injection of faults into interprocessor communications, and allows emulation of fault models commonly used in design of fault-tolerant parallel algorithms. The methodology has been applied in a tool for fault injection in Intel iPSC/860 multicomputers, and has been demonstrated through the extensive testing of a fault-tolerant broadcast algorithm.