Distributed Recovery in Fault-Tolerant Multiprocessor Networks

Authors:
R M Yanney;J P Hayes
Affiliations:
-;-
Venue:
IEEE Transactions on Computers
Year:
1986

Citing 11
Cited 3

Multiprocessor Organization—a Survey

ACM Computing Surveys (CSUR)
Computer Networks

Computer Networks
X-Tree: A tree structured multi-processor computer architecture

ISCA '78 Proceedings of the 5th annual symposium on Computer architecture
A study of the recoverability of computing systems.

A study of the recoverability of computing systems.
Scheduling of page-fetches in join operations

VLDB '81 Proceedings of the seventh international conference on Very Large Data Bases - Volume 7
The Basic Fault-tolerant System

IEEE Micro
Fault Tolerance in Binary Tree Architectures

IEEE Transactions on Computers
A Graph Model for Fault-Tolerant Computing Systems

IEEE Transactions on Computers
Design of HM2p A Hierarchical Multimicroprocessor for General-Purpose Applications

IEEE Transactions on Computers
A Model for Representing Programs Using Hierarchical Graphs

IEEE Transactions on Software Engineering
Program Graphs and Execution Behavior

IEEE Transactions on Software Engineering

On Designing and Reconfiguring k-Fault-Tolerant Tree Architectures

IEEE Transactions on Computers
The Balanced Hypercube: A Cube-Based System for Fault-Tolerant Applications

IEEE Transactions on Computers
Quantifying fault recovery in multiprocessor systems

Mathematical and Computer Modelling: An International Journal

Quantified Score

Hi-index	14.99

Visualization

Abstract

A methodology for characterizing dynamic distributed recovery in fault-tolerant multiprocessor systems is developed using graph theory. Distributed recovery, which is intended for systems with no central supervisor, depends on the cooperation of a set of processors to execute the recovery function, since each processor is assumed to have only a limited amount of information about the system as a whole. Facility graphs, whose nodes denote the system components (processors), and whose edges denote interconnection between components, are used to represent multiprocessor systems, and error conditions. A general distributed recovery strategy R, which allows global recovery to be achieved via a sequence of local actions, is given. R recovers the system in several steps in which different nodes successively act as the local supervisor. R is specialized for two important classes of systems: loop networks and tree networks. For each of these cases, fault-tolerant designs and their associated distributed recovery strategies, which allow recovery from up to k faults within a specified number of steps, are presented.