A Diagnosis Algorithm for Distributed Computing Systems with Dynamic Failure and Repair

Authors:
S. H. Hosseini;J. G. Kuhl;S. M. Reddy
Affiliations:
Department of Electrical Engineering and Computer Science, University of Wisconsin;-;-
Venue:
IEEE Transactions on Computers
Year:
1984

Citing 7
Cited 9

A Survey of Techniques for Synchronization and Recovery in Decentralized Computer Systems

ACM Computing Surveys (CSUR)
Distributed fault-tolerance for large multiprocessor systems

ISCA '80 Proceedings of the 7th annual symposium on Computer Architecture
X-Tree: A tree structured multi-processor computer architecture

ISCA '78 Proceedings of the 5th annual symposium on Computer architecture
Design and simulation of the distributed loop computer network (DLCN)

ISCA '76 Proceedings of the 3rd annual symposium on Computer architecture
A large scale, homogeneous, fully distributed parallel machine, I

ISCA '77 Proceedings of the 4th annual symposium on Computer architecture
Selecting sequence numbers

Proceedings of the 1975 ACM SIGCOMM/SIGOPS workshop on Interprocess communications
Fault-Tolerant Systems

IEEE Transactions on Computers

Distributed off-line testing of parallel systems

ATS '95 Proceedings of the 4th Asian Test Symposium
Ongoing fault diagnosis

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Hierarchical adaptive distributed system-level diagnosis applied for SNMP-based network fault management

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Network management and system-level diagnosis

ICCCN '95 Proceedings of the 4th International Conference on Computer Communications and Networks
A Local Diagnosability Measure for Multiprocessor Systems

IEEE Transactions on Parallel and Distributed Systems
A distributed fault identification protocol for wireless and mobile ad hoc networks

Journal of Parallel and Distributed Computing
Distributed testing and diagnosis in a mobile computing environment

Proceedings of the 6th International Wireless Communications and Mobile Computing Conference
Crash faults identification in wireless sensor networks

Computer Communications
System-level fault diagnosis in fixed topology mobile ad hoc networks

International Journal of Communication Networks and Distributed Systems

Quantified Score

Hi-index	14.98

Visualization

Abstract

The problem of designing distributed fault-tolerant computing systems is considered. A model in which the network nodes are assumed to possess the ability to "test" certain other network facilities for the presence of failures is employed. Using this model, a distributed algorithm is presented which allows all the network nodes to correctly reach independent diagnoses of the condition (faulty or fault-free) of all the network nodes and internode communication facilities, provided the total number of failures oes not exceed a given bound. The proposed algorithm allows for the reentry of repaired or replaced faulty facilities back into the network, and it also has provisions for adding new nodes to the system. Sufficient conditions are obtained for designing a distributed fault-tolerant system by employing the given algorithm. The algorithm has the interesting property that it lets as many as all of the nodes and internode communication facilities fail, but upon repair or replacement of faulty facilities, the system can converge to normal operation if no more than a certain number of facilities remain faulty.