Distributed fault-tolerance for large multiprocessor systems

  • Authors:
  • J. G. Kuhl;S. M. Reddy

  • Affiliations:
  • -;-

  • Venue:
  • ISCA '80 Proceedings of the 7th annual symposium on Computer Architecture
  • Year:
  • 1980

Quantified Score

Hi-index 0.10

Visualization

Abstract

Techniques for dealing with hardware failures in very large networks of distributed processing elements are presented. A concept known as distributed fault-tolerance is introduced. A model of a large multiprocessor system is developed and techniques, based on this model, are given by which each processing element can correctly diagnose failures in all other processing elements in the system. The effect of varying system interconnection structures upon the extent and efficiency of the diagnosis process is discussed, and illustrated with an example of an actual system. Finally, extensions to the model, which render it more realistic, are given and a modified version of the diagnosis procedure is presented which operates under this model.