A Survey of Techniques for Synchronization and Recovery in Decentralized Computer Systems
ACM Computing Surveys (CSUR)
Distributed fault-tolerance for large multiprocessor systems
ISCA '80 Proceedings of the 7th annual symposium on Computer Architecture
X-Tree: A tree structured multi-processor computer architecture
ISCA '78 Proceedings of the 5th annual symposium on Computer architecture
Design and simulation of the distributed loop computer network (DLCN)
ISCA '76 Proceedings of the 3rd annual symposium on Computer architecture
A large scale, homogeneous, fully distributed parallel machine, I
ISCA '77 Proceedings of the 4th annual symposium on Computer architecture
Proceedings of the 1975 ACM SIGCOMM/SIGOPS workshop on Interprocess communications
IEEE Transactions on Computers
Distributed off-line testing of parallel systems
ATS '95 Proceedings of the 4th Asian Test Symposium
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Network management and system-level diagnosis
ICCCN '95 Proceedings of the 4th International Conference on Computer Communications and Networks
A Local Diagnosability Measure for Multiprocessor Systems
IEEE Transactions on Parallel and Distributed Systems
A distributed fault identification protocol for wireless and mobile ad hoc networks
Journal of Parallel and Distributed Computing
Distributed testing and diagnosis in a mobile computing environment
Proceedings of the 6th International Wireless Communications and Mobile Computing Conference
Crash faults identification in wireless sensor networks
Computer Communications
System-level fault diagnosis in fixed topology mobile ad hoc networks
International Journal of Communication Networks and Distributed Systems
Hi-index | 14.98 |
The problem of designing distributed fault-tolerant computing systems is considered. A model in which the network nodes are assumed to possess the ability to "test" certain other network facilities for the presence of failures is employed. Using this model, a distributed algorithm is presented which allows all the network nodes to correctly reach independent diagnoses of the condition (faulty or fault-free) of all the network nodes and internode communication facilities, provided the total number of failures oes not exceed a given bound. The proposed algorithm allows for the reentry of repaired or replaced faulty facilities back into the network, and it also has provisions for adding new nodes to the system. Sufficient conditions are obtained for designing a distributed fault-tolerant system by employing the given algorithm. The algorithm has the interesting property that it lets as many as all of the nodes and internode communication facilities fail, but upon repair or replacement of faulty facilities, the system can converge to normal operation if no more than a certain number of facilities remain faulty.