Distributed Diagnosis in Dynamic Fault Environments

Authors:
Arun Subbiah;Douglas M. Blough
Affiliations:
-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2004

Citing 18
Cited 11

Undirected Graph Models for System-Level Fault Diagnosis

IEEE Transactions on Computers
Diagnosing Arbitrarily Connected Parallel Computers with High Probability

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Implementation of Online Distributed System-Level Diagnosis Theory

IEEE Transactions on Computers - Special issue on fault-tolerant computing
On Self-Diagnosable Multiprocessor Systems: Diagnosis by the Comparison Approach

IEEE Transactions on Computers
The consensus problem in fault-tolerant computing

ACM Computing Surveys (CSUR)
TTP-A Protocol for Fault-Tolerant Real-Time Systems

Computer
A formally verified algorithm for clock synchronization under a hybrid fault model

PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
A Distributed System-Level Diagnosis Algorithm for Arbitrary Network Topologies

IEEE Transactions on Computers - Special issue on fault-tolerant computing
A Hierarchical Adaptive Distributed System-Level Diagnosis Algorithm

IEEE Transactions on Computers
The Broadcast Comparison Model for On-Line Fault Diagnosis in Multicomputer Systems: Theory and Implementation

IEEE Transactions on Computers
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
Graph Algorithms

Graph Algorithms
A low-cost processor group membership protocol for a hard real-time distributed system

RTSS '97 Proceedings of the 18th IEEE Real-Time Systems Symposium
Membership and system diagnosis

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Self diagnosis of processor arrays using a comparison model

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Ongoing fault diagnosis

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
An Algorithm for Distributed Hierarchical Diagnosis of Dynamic Fault and Repair Events

ICPADS '00 Proceedings of the Seventh International Conference on Parallel and Distributed Systems
Failure detection and consensus in the crash-recovery model

Distributed Computing

A comparison of evolutionary algorithms for system-level diagnosis

GECCO '05 Proceedings of the 7th annual conference on Genetic and evolutionary computation
Diagnosing mobile ad-hoc networks: two distributed comparison-based self-diagnosis protocols

Proceedings of the 4th ACM international workshop on Mobility management and wireless access
Efficient Fault Identification of Diagnosable Systems under the Comparison Model

IEEE Transactions on Computers
Heartbeat based fault diagnosis for mobile ad-hoc network

ACST'07 Proceedings of the third conference on IASTED International Conference: Advances in Computer Science and Technology
A distributed fault identification protocol for wireless and mobile ad hoc networks

Journal of Parallel and Distributed Computing
A fault diagnosis algorithm for wireless sensor networks

PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
A survey of comparison-based system-level diagnosis

ACM Computing Surveys (CSUR)
A scalable multi-level distributed system-level diagnosis

ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology
System-level fault diagnosis in fixed topology mobile ad hoc networks

International Journal of Communication Networks and Distributed Systems
MoDiVHA: A Hierarchical Strategy for Distributed Test Assignment

Journal of Electronic Testing: Theory and Applications
COMMODITY12: A smart e-health environment for diabetes management

Journal of Ambient Intelligence and Smart Environments - Design and Deployment of Intelligent Environments

Quantified Score

Hi-index	0.00

Visualization

Abstract

Abstract--The problem of distributed diagnosis in the presence of dynamic failures and repairs is considered. To address this problem, the notion of bounded correctness is defined. Bounded correctness is made up of three properties: bounded diagnostic latency, which ensures that information about state changes of nodes in the system reaches working nodes with a bounded delay, bounded start-up time, which guarantees that working nodes determine valid states for every other node in the system within bounded time after their recovery, and accuracy, which ensures that no spurious events are recorded by working nodes. It is shown that, in order to achieve bounded correctness, the rate at which nodes fail and are repaired must be limited. This requirement is quantified by defining a minimum state holding time in the system. Algorithm HeartbeatComplete is presented and it is proven that this algorithm achieves bounded correctness in fully-connected systems while simultaneously minimizing diagnostic latency, start-up time, and state holding time. A diagnosis algorithm for arbitrary topologies, known as Algorithm ForwardHeartbeat, is also presented. ForwardHeartbeat is shown to produce significantly shorter latency and state holding time than prior algorithms, which focused primarily on minimizing the number of tests at the expense of latency.