The MAFT Architecture for Distributed Fault Tolerance
IEEE Transactions on Computers - Fault-Tolerant Computing
A new fault-tolerant algorithm for clock synchronization
Information and Computation
Understanding fault-tolerant distributed systems
Communications of the ACM
The consensus problem in fault-tolerant computing
ACM Computing Surveys (CSUR)
New Hybrid Fault Models for Asynchronous Approximate Agreement
IEEE Transactions on Computers
Formally Verified On-Line Diagnosis
IEEE Transactions on Software Engineering
IEEE Transactions on Computers
Reaching Agreement in the Presence of Faults
Journal of the ACM (JACM)
The Byzantine Generals Problem
ACM Transactions on Programming Languages and Systems (TOPLAS)
Dependability: Basic Concepts and Terminology
Dependability: Basic Concepts and Terminology
Consensus With Dual Failure Modes
IEEE Transactions on Parallel and Distributed Systems
How to Model Link Failures: A Perception-Based Fault Model
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Interval-based clock synchronization with optimal precision
Information and Computation
Journal of Parallel and Distributed Computing
Synchronous consensus under hybrid process and link failures
Theoretical Computer Science
Hi-index | 0.00 |
Dependability is a qualitative term referring to a system's ability to meet its service requirements in the presence of faults. The types and number of faults covered by a system play a primary role in determining the level of dependability which that system can potentially provide. Given the variety and multiplicity of fault types, to simplify the design process, the system algorithm design often focuses on specific fault types, resulting in either over-optimistic (all fault permanent) or over-pessimistic (all faults malicious) dependable system designs.A more practical and realistic approach is to recognize that faults of varied severity levels and of differing occurrence probabilities may appear as combinations rather than the assumed single fault type occurrences. The ability to allow the user to select/customize a particular combination of fault types of varied severity characterizes the proposed customizable fault/error model (CFEM). The CFEM organizes diverse fault categories into a cohesive framework by classifying faults according to the effect they have on the required system services rather than by targeting the source of the fault condition. In this paper, we develop (a) the complete framework for the CFEM fault classification, (b) the voting functions applicable under the CFEM, and (c) the fundamental distributed services of consensus and convergence under the CFEM on which dependable distributed functionality can be supported.