An approach to fault tolerance and error recovery in a parallel graph reduction machine: MaRS—a case study

  • Authors:
  • Alessandro Contessa

  • Affiliations:
  • Centre d'Etudes et de Recherches de Toulouse (CERT), P.O. Box 4025, 31055 Toulouse Cedex, France

  • Venue:
  • ACM SIGARCH Computer Architecture News
  • Year:
  • 1988

Quantified Score

Hi-index 0.00

Visualization

Abstract

This article deals with the issue of fault tolerance and error recovery in a parallel graph reduction computer such as the "MaRS" machine presently under development at CERT. This is a multiprocessor system with decentralized control and asynchronous, delayed communications between cooperating, tightly coupled processes. A solution for the problem of MaRS error recovery is derived, based on the machine's execution model (successive reductions performed on the program graph, i.e. evaluations on the functional expression to be computed) and on its architectural organization (a number of reduction units and memory units interconnected by a message switching network). Under the basic assumption that the errors generated by faults in Reduction and Communication Processors can be detected and confined so as to avoid system contamination, it is shown that a coherent and errorfree recovery state can be restored. Although specifically developed for the MaRS machine, this solution is in principle applicable to other machines using the graph reduction model.