An Algorithm for Determining the Fault Diagnosability of a System
IEEE Transactions on Computers
Implementation of Online Distributed System-Level Diagnosis Theory
IEEE Transactions on Computers - Special issue on fault-tolerant computing
On Self-Diagnosable Multiprocessor Systems: Diagnosis by the Comparison Approach
IEEE Transactions on Computers
Practical comparison-based fault diagnosis in multiprocessor systems
Practical comparison-based fault diagnosis in multiprocessor systems
Fault-tolerant broadcasts and related problems
Distributed systems (2nd Ed.)
Fail-stop processors: an approach to designing fault-tolerant computing systems
ACM Transactions on Computer Systems (TOCS)
Reliable broadcasting in product networks with Byzantine faults
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Distributed fault-tolerance for large multiprocessor systems
ISCA '80 Proceedings of the 7th annual symposium on Computer Architecture
A comparison connection assignment for diagnosis of multiprocessor systems
ISCA '80 Proceedings of the 7th annual symposium on Computer Architecture
Membership and system diagnosis
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
The customizable fault/error model for dependable distributed systems
Theoretical Computer Science - Dependable computing
Efficient Comparison-Based Fault Diagnosis of Multiprocessor Systems Using Genetic Algorithms
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Distributed Diagnosis in Dynamic Fault Environments
IEEE Transactions on Parallel and Distributed Systems
A flexible formal framework for masking/demasking faults
Information Sciences—Informatics and Computer Science: An International Journal
IEEE Transactions on Parallel and Distributed Systems
A Fault-Tolerant Protocol for Energy-Efficient Permutation Routing in Wireless Networks
IEEE Transactions on Computers
Proceedings of the 9th ACM international symposium on Modeling analysis and simulation of wireless and mobile systems
Diagnosing mobile ad-hoc networks: two distributed comparison-based self-diagnosis protocols
Proceedings of the 4th ACM international workshop on Mobility management and wireless access
Online Diagnosis and Recovery: On the Choice and Impact of Tuning Parameters
IEEE Transactions on Dependable and Secure Computing
Heartbeat based fault diagnosis for mobile ad-hoc network
ACST'07 Proceedings of the third conference on IASTED International Conference: Advances in Computer Science and Technology
A distributed fault identification protocol for wireless and mobile ad hoc networks
Journal of Parallel and Distributed Computing
International Journal of Parallel, Emergent and Distributed Systems
An evolutionary approach to system-level fault diagnosis
CEC'09 Proceedings of the Eleventh conference on Congress on Evolutionary Computation
Sensor deployment for failure diagnosis in networked aerial robots: a satisfiability-based approach
SAT'07 Proceedings of the 10th international conference on Theory and applications of satisfiability testing
A survey of comparison-based system-level diagnosis
ACM Computing Surveys (CSUR)
Crash faults identification in wireless sensor networks
Computer Communications
Sensor deployment for fault diagnosis using a new discrete optimization algorithm
Applied Soft Computing
System-level fault diagnosis in fixed topology mobile ad hoc networks
International Journal of Communication Networks and Distributed Systems
Online Distributed Fault Diagnosis in Wireless Sensor Networks
Wireless Personal Communications: An International Journal
Hi-index | 14.98 |
This paper describes a new comparison-based model for distributed fault diagnosis in multicomputer systems with a weak reliable broadcast capability. The classical problems of diagnosability and diagnosis are both considered under this broadcast comparison model. A characterization of diagnosable systems is given, which leads to a polynomial-time diagnosability algorithm. A polynomial-time diagnosis algorithm for $t$-diagnosable systems is also given. A variation of this algorithm, which allows dynamic fault occurrence and incomplete diagnostic information, has been implemented in the COmmon Spaceborne Multicomputer Operating System (COSMOS). Results produced using a simulator for the JPL MAX multicomputer system running COSMOS show that the algorithm diagnoses all fault situations with low latency and very little overhead. These simulations demonstrate the practicality of the proposed diagnosis model and algorithm for multicomputer systems having weak reliable broadcast. This includes systems with fault-tolerant hardware for broadcast, as well as those where reliable broadcast is implemented in software.