An Algorithm for Distributed Hierarchical Diagnosis of Dynamic Fault and Repair Events

  • Authors:
  • Elias Procopio Duarte, Jr.;Alessandro Brawerman;Luiz Carlos P. Albini

  • Affiliations:
  • -;-;-

  • Venue:
  • ICPADS '00 Proceedings of the Seventh International Conference on Parallel and Distributed Systems
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

The components of a fault-tolerant distributed system must be capable to accurately determine which components of the system are faulty and which are fault-free. In this paper, we present a new distributed algorithm for event diagnosis in fully connected networks. An event is defined as a faulty node becoming fault-free, or the opposite. Previous hierarchical algorithms consider a static fault situation, in which an event can only occur after the previous event has been fully diagnosed. The new algorithm is capable of achieving the diagnosis of dynamic events as long as nodes stay in a given state for a period long enough for all testers to detect that state. Each node running the algorithm keeps a timestamp for the state of each other node in the system. This timestamp is implemented as a counter, which is incremented every time a node changes its state. In this way, each tester may get information about a given node in the system from more than one tested node without causing any inconsistencies, i.e. without taking an older state for a newer one. Nodes run a hierarchical testing strategy, which is a hypercube when all nodes are fault-free. When a fault-free node is tested, the tester gets diagnostic information about N/2 nodes, for a system of N nodes. In spite of the overhead of keeping and transferring timestamps, the new algorithm significantly reduces the average latency when compared to other similar approaches, presenting a new option for practical diagnosis implementation.