An Algorithm for Distributed Hierarchical Diagnosis of Dynamic Fault and Repair Events

Authors:
Elias Procopio Duarte, Jr.;Alessandro Brawerman;Luiz Carlos P. Albini
Affiliations:
-;-;-
Venue:
ICPADS '00 Proceedings of the Seventh International Conference on Parallel and Distributed Systems
Year:
2000

Citing 0
Cited 7

A partitioning method for efficient system-level diagnosis

Journal of Systems and Software
Distributed Diagnosis in Dynamic Fault Environments

IEEE Transactions on Parallel and Distributed Systems
A Flexible Approach for Defining Distributed Dependable Tests in SNMP-Based Network Management Systems

Journal of Electronic Testing: Theory and Applications
Heartbeat based fault diagnosis for mobile ad-hoc network

ACST'07 Proceedings of the third conference on IASTED International Conference: Advances in Computer Science and Technology
A distributed fault identification protocol for wireless and mobile ad hoc networks

Journal of Parallel and Distributed Computing
A survey of comparison-based system-level diagnosis

ACM Computing Surveys (CSUR)
A scalable multi-level distributed system-level diagnosis

ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

The components of a fault-tolerant distributed system must be capable to accurately determine which components of the system are faulty and which are fault-free. In this paper, we present a new distributed algorithm for event diagnosis in fully connected networks. An event is defined as a faulty node becoming fault-free, or the opposite. Previous hierarchical algorithms consider a static fault situation, in which an event can only occur after the previous event has been fully diagnosed. The new algorithm is capable of achieving the diagnosis of dynamic events as long as nodes stay in a given state for a period long enough for all testers to detect that state. Each node running the algorithm keeps a timestamp for the state of each other node in the system. This timestamp is implemented as a counter, which is incremented every time a node changes its state. In this way, each tester may get information about a given node in the system from more than one tested node without causing any inconsistencies, i.e. without taking an older state for a newer one. Nodes run a hierarchical testing strategy, which is a hypercube when all nodes are fault-free. When a fault-free node is tested, the tester gets diagnostic information about N/2 nodes, for a system of N nodes. In spite of the overhead of keeping and transferring timestamps, the new algorithm significantly reduces the average latency when compared to other similar approaches, presenting a new option for practical diagnosis implementation.