A Hierarchical Adaptive Distributed System-Level Diagnosis Algorithm

Authors:
Elias Procópio Duarte, Jr.;Takashi Nanya
Affiliations:
Federal Univ. of Paraná, Curitiba PR, Brazil;Univ. of Tokyo, Tokyo, Japan
Venue:
IEEE Transactions on Computers
Year:
1998

Citing 9
Cited 17

Simulating computer systems: techniques and tools

Simulating computer systems: techniques and tools
Implementation of Online Distributed System-Level Diagnosis Theory

IEEE Transactions on Computers - Special issue on fault-tolerant computing
The simple book (2nd ed.): an introduction to internet management

The simple book (2nd ed.): an introduction to internet management
Fault tolerance in distributed systems

Fault tolerance in distributed systems
A Distributed System-Level Diagnosis Algorithm for Arbitrary Network Topologies

IEEE Transactions on Computers - Special issue on fault-tolerant computing
System diagnosis

Fault-tolerant computer system design
SNMP, SNMPv2, and RMON (2nd ed.): practical network management

SNMP, SNMPv2, and RMON (2nd ed.): practical network management
An Approach for Hierarchical System Level Diagnosis of Massively Parallel Computers Combined with a Simulation-Based Method for Dependability Analysis

EDCC-1 Proceedings of the First European Dependable Computing Conference on Dependable Computing
Distributed fault-tolerance for large multiprocessor systems

ISCA '80 Proceedings of the 7th annual symposium on Computer Architecture

An Isochronous Testing Strategy for Hierarchical Adaptive Distributed System-Level Diagnosis

Journal of Electronic Testing: Theory and Applications
A partitioning method for efficient system-level diagnosis

Journal of Systems and Software
Improving Fault Coverage in System Tests

IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
Distributed Diagnosis in Dynamic Fault Environments

IEEE Transactions on Parallel and Distributed Systems
Reliable Distributed Network Management by Replication

Journal of Network and Systems Management
Hierarchical Fault Diagnosis for Discrete-Event Systems under Global Consistency

Discrete Event Dynamic Systems
Efficient Fault Identification of Diagnosable Systems under the Comparison Model

IEEE Transactions on Computers
Automated Rule-Based Diagnosis through a Distributed Monitor System

IEEE Transactions on Dependable and Secure Computing
Heartbeat based fault diagnosis for mobile ad-hoc network

ACST'07 Proceedings of the third conference on IASTED International Conference: Advances in Computer Science and Technology
A distributed fault identification protocol for wireless and mobile ad hoc networks

Journal of Parallel and Distributed Computing
An analytical framework for the modelling and evaluation of the mobile agent based distributed network management paradigm

International Journal of High Performance Computing and Networking
Distributed multiple-path searching algorithm for fault detection

ICCSA'03 Proceedings of the 2003 international conference on Computational science and its applications: PartII
Distributed testing and diagnosis in a mobile computing environment

Proceedings of the 6th International Wireless Communications and Mobile Computing Conference
A survey of comparison-based system-level diagnosis

ACM Computing Surveys (CSUR)
Fault diagnosis for hypercube-like networks

AICT'11 Proceedings of the 2nd international conference on Applied informatics and computing theory
A scalable multi-level distributed system-level diagnosis

ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology
MoDiVHA: A Hierarchical Strategy for Distributed Test Assignment

Journal of Electronic Testing: Theory and Applications

Quantified Score

Hi-index	14.98

Visualization

Abstract

Consider a system composed of N nodes that can be faulty or fault-free. The purpose of distributed system-level diagnosis is to have each fault-free node determine the state of all nodes of the system. This paper presents a Hierarchical Adaptive Distributed System-level Diagnosis (Hi-ADSD) algorithm, which is a fully distributed algorithm that allows every fault-free node to achieve diagnosis in, at most, (log 2 N)2 testing rounds. Nodes are mapped into progressively larger logical clusters, so that tests are run in a hierarchical fashion. Each node executes its tests independently of the other nodes, i.e., tests are run asynchronously. All the information that nodes exchange is diagnostic information. The algorithm assumes no link faults, a fully-connected network and imposes no bounds on the number of faults. Both the worst-case diagnosis latency and correctness of the algorithm are formally proved. As an example application, the algorithm was implemented on a 37-node Ethernet LAN, integrated to a network management system based on SNMP (Simple Network Management Protocol). Experimental results of fault and repair diagnosis are presented. This implementation by itself is also a significant contribution, for, although fault management is a key functional area of network management systems, currently deployed applications often implement only rudimentary diagnosis mechanisms. Furthermore, experimental results are given through simulation of the algorithm for large systems of 64 nodes and 512 nodes