Undirected Graph Models for System-Level Fault Diagnosis
IEEE Transactions on Computers
Diagnosing Arbitrarily Connected Parallel Computers with High Probability
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Implementation of Online Distributed System-Level Diagnosis Theory
IEEE Transactions on Computers - Special issue on fault-tolerant computing
On Self-Diagnosable Multiprocessor Systems: Diagnosis by the Comparison Approach
IEEE Transactions on Computers
The consensus problem in fault-tolerant computing
ACM Computing Surveys (CSUR)
A formally verified algorithm for clock synchronization under a hybrid fault model
PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
A Distributed System-Level Diagnosis Algorithm for Arbitrary Network Topologies
IEEE Transactions on Computers - Special issue on fault-tolerant computing
A Hierarchical Adaptive Distributed System-Level Diagnosis Algorithm
IEEE Transactions on Computers
IEEE Transactions on Computers
Fail-stop processors: an approach to designing fault-tolerant computing systems
ACM Transactions on Computer Systems (TOCS)
Graph Algorithms
A low-cost processor group membership protocol for a hard real-time distributed system
RTSS '97 Proceedings of the 18th IEEE Real-Time Systems Symposium
Membership and system diagnosis
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Self diagnosis of processor arrays using a comparison model
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
An Algorithm for Distributed Hierarchical Diagnosis of Dynamic Fault and Repair Events
ICPADS '00 Proceedings of the Seventh International Conference on Parallel and Distributed Systems
Failure detection and consensus in the crash-recovery model
Distributed Computing
A comparison of evolutionary algorithms for system-level diagnosis
GECCO '05 Proceedings of the 7th annual conference on Genetic and evolutionary computation
Diagnosing mobile ad-hoc networks: two distributed comparison-based self-diagnosis protocols
Proceedings of the 4th ACM international workshop on Mobility management and wireless access
Efficient Fault Identification of Diagnosable Systems under the Comparison Model
IEEE Transactions on Computers
Heartbeat based fault diagnosis for mobile ad-hoc network
ACST'07 Proceedings of the third conference on IASTED International Conference: Advances in Computer Science and Technology
A distributed fault identification protocol for wireless and mobile ad hoc networks
Journal of Parallel and Distributed Computing
A fault diagnosis algorithm for wireless sensor networks
PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
A survey of comparison-based system-level diagnosis
ACM Computing Surveys (CSUR)
A scalable multi-level distributed system-level diagnosis
ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology
System-level fault diagnosis in fixed topology mobile ad hoc networks
International Journal of Communication Networks and Distributed Systems
MoDiVHA: A Hierarchical Strategy for Distributed Test Assignment
Journal of Electronic Testing: Theory and Applications
COMMODITY12: A smart e-health environment for diabetes management
Journal of Ambient Intelligence and Smart Environments - Design and Deployment of Intelligent Environments
Hi-index | 0.00 |
Abstract--The problem of distributed diagnosis in the presence of dynamic failures and repairs is considered. To address this problem, the notion of bounded correctness is defined. Bounded correctness is made up of three properties: bounded diagnostic latency, which ensures that information about state changes of nodes in the system reaches working nodes with a bounded delay, bounded start-up time, which guarantees that working nodes determine valid states for every other node in the system within bounded time after their recovery, and accuracy, which ensures that no spurious events are recorded by working nodes. It is shown that, in order to achieve bounded correctness, the rate at which nodes fail and are repaired must be limited. This requirement is quantified by defining a minimum state holding time in the system. Algorithm HeartbeatComplete is presented and it is proven that this algorithm achieves bounded correctness in fully-connected systems while simultaneously minimizing diagnostic latency, start-up time, and state holding time. A diagnosis algorithm for arbitrary topologies, known as Algorithm ForwardHeartbeat, is also presented. ForwardHeartbeat is shown to produce significantly shorter latency and state holding time than prior algorithms, which focused primarily on minimizing the number of tests at the expense of latency.