Synchronizing clocks in the presence of faults
Journal of the ACM (JACM)
A New Measure for Hybrid Fault Diagnosability
IEEE Transactions on Computers
The MAFT Architecture for Distributed Fault Tolerance
IEEE Transactions on Computers - Fault-Tolerant Computing
Design and validation of computer protocols
Design and validation of computer protocols
On Self-Diagnosable Multiprocessor Systems: Diagnosis by the Comparison Approach
IEEE Transactions on Computers
The consensus problem in fault-tolerant computing
ACM Computing Surveys (CSUR)
A formally verified algorithm for clock synchronization under a hybrid fault model
PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Formal Verification for Fault-Tolerant Architectures: Prolegomena to the Design of PVS
IEEE Transactions on Software Engineering
Reaching Agreement in the Presence of Faults
Journal of the ACM (JACM)
The Byzantine Generals Problem
ACM Transactions on Programming Languages and Systems (TOPLAS)
Advances in ULTRA-Dependable Distributed Systems
Advances in ULTRA-Dependable Distributed Systems
Consensus With Dual Failure Modes
IEEE Transactions on Parallel and Distributed Systems
Formal Verification of Algorithms for Critical Systems
IEEE Transactions on Software Engineering
Mechanical Verification of a Generalized Protocol for Byzantine Fault Tolerant Clock Synchronization
Proceedings of the Second International Symposium on Formal Techniques in Real-Time and Fault-Tolerant Systems
Reconfiguration and transient recovery in state machine architectures
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Distributed fault-tolerance for large multiprocessor systems
ISCA '80 Proceedings of the 7th annual symposium on Computer Architecture
Systematic Formal Verification for Fault-Tolerant Time-Triggered Algorithms
IEEE Transactions on Software Engineering
Automatic Analysis of Consistency between Requirements and Designs
IEEE Transactions on Software Engineering
The customizable fault/error model for dependable distributed systems
Theoretical Computer Science - Dependable computing
How to Model Link Failures: A Perception-Based Fault Model
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
How to reconcile fault-tolerant interval intersection with the Lipschitz condition
Distributed Computing
IEEE Transactions on Parallel and Distributed Systems
A Maintenance-Oriented Fault Model for the DECOS Integrated Diagnostic Architecture
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 2 - Volume 03
Online Diagnosis and Recovery: On the Choice and Impact of Tuning Parameters
IEEE Transactions on Dependable and Secure Computing
Heartbeat based fault diagnosis for mobile ad-hoc network
ACST'07 Proceedings of the third conference on IASTED International Conference: Advances in Computer Science and Technology
The CRUTIAL Architecture for Critical Information Infrastructures
Architecting Dependable Systems V
Sensor deployment for failure diagnosis in networked aerial robots: a satisfiability-based approach
SAT'07 Proceedings of the 10th international conference on Theory and applications of satisfiability testing
Runtime verification in context: can optimizing error detection improve fault diagnosis?
RV'10 Proceedings of the First international conference on Runtime verification
Hi-index | 0.00 |
A reconfigurable fault tolerant system achieves the attributes of dependability of operations through fault detection, fault isolation and reconfiguration, typically referred to as the FDIR paradigm. Fault diagnosis is a key component of this approach, requiring an accurate determination of the health and state of the system. An imprecise state assessment can lead to catastrophic failure due to an optimistic diagnosis, or conversely, result in underutilization of resources because of a pessimistic diagnosis. Differing from classical testing and other off-line diagnostic approaches, we develop procedures for maximal utilization of the system state information to provide for continual, on-line diagnosis and reconfiguration capabilities as an integral part of the system operations. Our diagnosis approach, unlike existing techniques, does not require administered testing to gather syndrome information but is based on monitoring the system message traffic among redundant system functions. We present comprehensive on-line diagnosis algorithms capable of handling a continuum of faults of varying severity at the node and link level. Not only are the proposed algorithms on-line in nature, but are themselves tolerant to faults in the diagnostic process. Formal analysis is presented for all proposed algorithms. These proofs offer both insight into the algorithm operations and facilitate a rigorous formal verification of the developed algorithms.