The C programming language
Partitioning Techniques for Large-Grained Parallelism
IEEE Transactions on Computers
A data-driven execution paradigm for distributed fault-tolerance
EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
Fault Tolerance in Multiprocessor Systems Without Dedicated Redundancy
IEEE Transactions on Computers
Hi-index | 0.01 |
RAFT is a recursive algorithm for fault tolerance that uses a combination of dynamic space and time redundancy techniques for detecting faulty processors and recovering from errors. U* is a multicomputer testbed consisting of a network of AT&T 3B2 computers running a network operating system based on the UNIX system. This paper describes a software implementation of RAFT on U*, and demonstrates the effectiveness of a RAFT-like scheme for designing fault-tolerant multicomputer systems. Results of Monte Carlo experiments, conducted on this system that validated the theoretical basis of RAFT, are presented. Experimentally observed performance penalty, incurred due to fault tolerance, is also presented.