A partitioning method for efficient system-level diagnosis
Journal of Systems and Software
Distributed Diagnosis in Dynamic Fault Environments
IEEE Transactions on Parallel and Distributed Systems
Journal of Electronic Testing: Theory and Applications
Heartbeat based fault diagnosis for mobile ad-hoc network
ACST'07 Proceedings of the third conference on IASTED International Conference: Advances in Computer Science and Technology
A distributed fault identification protocol for wireless and mobile ad hoc networks
Journal of Parallel and Distributed Computing
A survey of comparison-based system-level diagnosis
ACM Computing Surveys (CSUR)
A scalable multi-level distributed system-level diagnosis
ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology
Hi-index | 0.00 |
The components of a fault-tolerant distributed system must be capable to accurately determine which components of the system are faulty and which are fault-free. In this paper, we present a new distributed algorithm for event diagnosis in fully connected networks. An event is defined as a faulty node becoming fault-free, or the opposite. Previous hierarchical algorithms consider a static fault situation, in which an event can only occur after the previous event has been fully diagnosed. The new algorithm is capable of achieving the diagnosis of dynamic events as long as nodes stay in a given state for a period long enough for all testers to detect that state. Each node running the algorithm keeps a timestamp for the state of each other node in the system. This timestamp is implemented as a counter, which is incremented every time a node changes its state. In this way, each tester may get information about a given node in the system from more than one tested node without causing any inconsistencies, i.e. without taking an older state for a newer one. Nodes run a hierarchical testing strategy, which is a hypercube when all nodes are fault-free. When a fault-free node is tested, the tester gets diagnostic information about N/2 nodes, for a system of N nodes. In spite of the overhead of keeping and transferring timestamps, the new algorithm significantly reduces the average latency when compared to other similar approaches, presenting a new option for practical diagnosis implementation.