The customizable fault/error model for dependable distributed systems

Authors:
C. J. Walter;N. Suri
Affiliations:
WW Technology Group, 4519 Mustering Drum, Ellicott City, MD;Department of Computer Engineering, Chalmers University, S 41296, Goteborg, Sweden
Venue:
Theoretical Computer Science - Dependable computing
Year:
2003

Citing 11
Cited 4

The MAFT Architecture for Distributed Fault Tolerance

IEEE Transactions on Computers - Fault-Tolerant Computing
A new fault-tolerant algorithm for clock synchronization

Information and Computation
Understanding fault-tolerant distributed systems

Communications of the ACM
The consensus problem in fault-tolerant computing

ACM Computing Surveys (CSUR)
New Hybrid Fault Models for Asynchronous Approximate Agreement

IEEE Transactions on Computers
Formally Verified On-Line Diagnosis

IEEE Transactions on Software Engineering
The Broadcast Comparison Model for On-Line Fault Diagnosis in Multicomputer Systems: Theory and Implementation

IEEE Transactions on Computers
Reaching Agreement in the Presence of Faults

Journal of the ACM (JACM)
The Byzantine Generals Problem

ACM Transactions on Programming Languages and Systems (TOPLAS)
Dependability: Basic Concepts and Terminology

Dependability: Basic Concepts and Terminology
Consensus With Dual Failure Modes

IEEE Transactions on Parallel and Distributed Systems

How to Model Link Failures: A Perception-Based Fault Model

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Interval-based clock synchronization with optimal precision

Information and Computation
Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Journal of Parallel and Distributed Computing
Synchronous consensus under hybrid process and link failures

Theoretical Computer Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Dependability is a qualitative term referring to a system's ability to meet its service requirements in the presence of faults. The types and number of faults covered by a system play a primary role in determining the level of dependability which that system can potentially provide. Given the variety and multiplicity of fault types, to simplify the design process, the system algorithm design often focuses on specific fault types, resulting in either over-optimistic (all fault permanent) or over-pessimistic (all faults malicious) dependable system designs.A more practical and realistic approach is to recognize that faults of varied severity levels and of differing occurrence probabilities may appear as combinations rather than the assumed single fault type occurrences. The ability to allow the user to select/customize a particular combination of fault types of varied severity characterizes the proposed customizable fault/error model (CFEM). The CFEM organizes diverse fault categories into a cohesive framework by classifying faults according to the effect they have on the required system services rather than by targeting the source of the fault condition. In this paper, we develop (a) the complete framework for the CFEM fault classification, (b) the voting functions applicable under the CFEM, and (c) the fundamental distributed services of consensus and convergence under the CFEM on which dependable distributed functionality can be supported.