Software implementation of a recursive fault tolerance algorithm on a network of computers

  • Authors:
  • P. Agrawal;R. Agrawal

  • Affiliations:
  • AT&T Bell LaboratoriesMurray Hill, New Jersey;-

  • Venue:
  • ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
  • Year:
  • 1986

Quantified Score

Hi-index 0.01

Visualization

Abstract

RAFT is a recursive algorithm for fault tolerance that uses a combination of dynamic space and time redundancy techniques for detecting faulty processors and recovering from errors. U* is a multicomputer testbed consisting of a network of AT&T 3B2 computers running a network operating system based on the UNIX system. This paper describes a software implementation of RAFT on U*, and demonstrates the effectiveness of a RAFT-like scheme for designing fault-tolerant multicomputer systems. Results of Monte Carlo experiments, conducted on this system that validated the theoretical basis of RAFT, are presented. Experimentally observed performance penalty, incurred due to fault tolerance, is also presented.