Fault tolerant and fault testable hardware design
Fault tolerant and fault testable hardware design
Reliability Issues in Computing System Design
ACM Computing Surveys (CSUR)
Fault Tolerance: Principles and Practice
Fault Tolerance: Principles and Practice
The Implementation of Functional Programming Languages (Prentice-Hall International Series in Computer Science)
Hi-index | 0.00 |
This article deals with the issue of fault tolerance and error recovery in a parallel graph reduction computer such as the "MaRS" machine presently under development at CERT. This is a multiprocessor system with decentralized control and asynchronous, delayed communications between cooperating, tightly coupled processes. A solution for the problem of MaRS error recovery is derived, based on the machine's execution model (successive reductions performed on the program graph, i.e. evaluations on the functional expression to be computed) and on its architectural organization (a number of reduction units and memory units interconnected by a message switching network). Under the basic assumption that the errors generated by faults in Reduction and Communication Processors can be detected and confined so as to avoid system contamination, it is shown that a coherent and errorfree recovery state can be restored. Although specifically developed for the MaRS machine, this solution is in principle applicable to other machines using the graph reduction model.