Fundamentals of fault-tolerant distributed computing in asynchronous environments

Authors:
Felix C. Gärtner
Affiliations:
Darmstadt Univ. of Technology, Darmstadt, Germany
Venue:
ACM Computing Surveys (CSUR)
Year:
1999

Citing 78
Cited 54

Reaching approximate agreement in the presence of faults

Journal of the ACM (JACM)
Easy impossibility proofs for distributed consensus problems

Distributed Computing
How processes learn

Distributed Computing
On the minimal synchronism needed for distributed consensus

Journal of the ACM (JACM)
Concurrency control and recovery in database systems

Concurrency control and recovery in database systems
Consensus in the presence of partial synchrony

Journal of the ACM (JACM)
A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Knowledge and common knowledge in a distributed environment

Journal of the ACM (JACM)
Fault-Tolerant Computing: Fundamental Concepts

Computer
Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys (CSUR)
Understanding fault-tolerant distributed systems

Communications of the ACM
Distributed computing: models and methods

Handbook of theoretical computer science (vol. B)
Consistent detection of global predicates

PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
Reliable computer systems (2nd ed.): design and evaluation

Reliable computer systems (2nd ed.): design and evaluation
The Many Faces of Consensus in Distributed Systems

Computer
Self-stabilization

ACM Computing Surveys (CSUR)
The consensus problem in fault-tolerant computing

ACM Computing Surveys (CSUR)
The process group approach to reliable distributed computing

Communications of the ACM
A foundation of fault-tolerant computing

A foundation of fault-tolerant computing
Closure and Convergence: A Foundation of Fault-Tolerant Computing

IEEE Transactions on Software Engineering - Special issue on software reliability
Fault tolerance in distributed systems

Fault tolerance in distributed systems
Local and temporal predicates in distributed systems

ACM Transactions on Programming Languages and Systems (TOPLAS)
Introduction to distributed algorithms

Introduction to distributed algorithms
Impossibility of distributed consensus with one faulty process

Journal of the ACM (JACM)
Asynchronous consensus and broadcast protocols

Journal of the ACM (JACM)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
The weakest failure detector for solving consensus

Journal of the ACM (JACM)
Constraint satisfaction as a basis for designing nonmasking fault-tolerance

Journal of High Speed Networks
Detection of Strong Unstable Predicates in Distributed Programs

IEEE Transactions on Parallel and Distributed Systems
On the impossibility of group membership

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Component Based Design of Multitolerant Systems

IEEE Transactions on Software Engineering
Designing Masking Fault-Tolerance via Nonmasking Fault-Tolerance

IEEE Transactions on Software Engineering
Distributed systems (2nd Ed.)

Distributed systems (2nd Ed.)
What good are models and what models are good?

Distributed systems (2nd Ed.)
Consistent global states of distributed systems: fundamental concepts and mechanisms

Distributed systems (2nd Ed.)
The Byzantine Generals Problem

ACM Transactions on Programming Languages and Systems (TOPLAS)
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
The space shuttle primary computer system

Communications of the ACM
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Guarded commands, nondeterminacy and formal derivation of programs

Communications of the ACM
Self-stabilizing systems in spite of distributed control

Communications of the ACM
Distributed Algorithms

Distributed Algorithms
Sources of Failure in the Public Switched Telephone Network

Computer
Fault Injection Techniques and Tools

Computer
Distributed Reset

IEEE Transactions on Computers
Specifying Graceful Degradation

IEEE Transactions on Parallel and Distributed Systems
Detection of Weak Unstable Predicates in Distributed Programs

IEEE Transactions on Parallel and Distributed Systems
Detection of Global State Predicates

WDAG '91 Proceedings of the 5th International Workshop on Distributed Algorithms
Primary Partition "Virtually-Synchronous Communication" harder than Consensus

WDAG '94 Proceedings of the 8th International Workshop on Distributed Algorithms
Replicated File Management in Large-Scale Distributed Systems

WDAG '94 Proceedings of the 8th International Workshop on Distributed Algorithms
On Real-Time and Non Real-Time Distributed Computing

WDAG '95 Proceedings of the 9th International Workshop on Distributed Algorithms
Faster Possibility Detection by Combining Two Approaches

WDAG '95 Proceedings of the 9th International Workshop on Distributed Algorithms
Simulating Reliable Links with Unreliable Links in the Presence of Process Crashes

WDAG '96 Proceedings of the 10th International Workshop on Distributed Algorithms
Detecting Global Predicates in Distributed Systems with Clocks

WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication

WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
Failure Detection and Consensus in the Crash-Recovery Model

DISC '98 Proceedings of the 12th International Symposium on Distributed Computing
Compositional Design of Multitolerant Repetitive Byzantine Agreement

Proceedings of the 17th Conference on Foundations of Software Technology and Theoretical Computer Science
Verifying Fault Tolerance of Distributed Algorithms Formally - An Example

CSD '98 Proceedings of the 1998 International Conference on Application of Concurrency to System Design
The redundancy mechanisms of the Ariane 5 Operational Control Center

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
The Timed Asynchronous Distributed System Model

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
The SunSCALR Framework for Internet Servers

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
How Fail-Stop are Faulty Programs?

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Consensus: The Big Misunderstanding

FTDCS '97 Proceedings of the 6th IEEE Workshop on Future Trends of Distributed Computing Systems
Using Light-Weight Groups to Handle Timing Failures in Quasi-Synchronous Systems

RTSS '98 Proceedings of the IEEE Real-Time Systems Symposium
Non blocking atomic commitment with an unreliable failure detector

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Distributed Predicate Detection in a Faulty Environment

ICDCS '98 Proceedings of the The 18th International Conference on Distributed Computing Systems
Detectors and Correctors: A Theory of Fault-Tolerance Components

ICDCS '98 Proceedings of the The 18th International Conference on Distributed Computing Systems
A Modular Approach to Fault-Tolerant Broadcasts and Related Problems

A Modular Approach to Fault-Tolerant Broadcasts and Related Problems
Election Vs. Consensus in Asynchronous Systems

Election Vs. Consensus in Asynchronous Systems
Solving Problems in the Presence of Process Crashes and Lossy Links

Solving Problems in the Presence of Process Crashes and Lossy Links
Failure Detectors in Omission Failure Environments

Failure Detectors in Omission Failure Environments
Quiescent Reliable Communication and Quiescent Consensus in Partitionable Networks

Quiescent Reliable Communication and Quiescent Consensus in Partitionable Networks
On the Weakest Failure Detector for Quiescent Reliable Communication

On the Weakest Failure Detector for Quiescent Reliable Communication
Partitionable Group Membership: Specification and Algorithms

Partitionable Group Membership: Specification and Algorithms
ACCESSING REPLICATED DATA IN A LARGE-SCALE DISTRIBUTED SYSTEM (M.S. Thesis)

ACCESSING REPLICATED DATA IN A LARGE-SCALE DISTRIBUTED SYSTEM (M.S. Thesis)
Detection of global predicates: techniques and their limitations

Distributed Computing
Synchronous, asynchronous, and causally ordered communication

Distributed Computing
Detecting causal relationships in distributed computations: in search of the holy grail

Distributed Computing

Software engineering for safety: a roadmap

Proceedings of the Conference on The Future of Software Engineering
Handling Obstacles in Goal-Oriented Requirements Engineering

IEEE Transactions on Software Engineering - special section on current trends in exception handling—part II
Realizing fault resilience in Web-server cluster

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Model-Based Fault-tolerant Control Reconfiguration for General Network Topologies

IEEE Micro
Genetic algorithm and PID Control together for dynamic anticipative marginal buffer management: an effective approach to enhance dependability and performance for distributed mobile object-based real-time computing over the internet

Journal of Parallel and Distributed Computing - Problems in parallel and distributed computing: Solutions based on evolutionary paradigms
Concurrent exception handling

Advances in exception handling techniques
Distributed Peer-to-Peer Control in Harness

ICCS '02 Proceedings of the International Conference on Computational Science-Part II
Concurrent Exception Handling

Advances in Exception Handling Techniques (the book grow out of a ECOOP 2000 workshop)
A Model for Mobile Code Using Interacting Automata

IEEE Transactions on Mobile Computing
Error Scope on a Computational Grid: Theory and Practice

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Modular event-based systems

The Knowledge Engineering Review
Multilevel fault tolerance in infrastructure-oriented wireless networks: framework and performance evaluation

International Journal of Network Management
Schizophrenic middleware support for fault tolerance

Proceedings of the 2006 annual ACM SIGAda international conference on Ada
Efficient task replication and management for adaptive fault tolerance in mobile Grid environments

Future Generation Computer Systems - Special section: Information engineering and enterprise architecture in distributed computing environments
Specifying and using intrusion masking models to process distributed operations

Journal of Computer Security
FC-ORB: A robust distributed real-time embedded middleware with end-to-end utilization control

Journal of Systems and Software
Fault and adversary tolerance as an emergent property of distributed systems' software architectures

Proceedings of the 2007 workshop on Engineering fault tolerant systems
Communication analysis of distributed programs

Scientific Programming - Parallel/High-Performance Object-Oriented Scientific Computing (POOSC '05), Glasgow, UK, 25 July 2005
Flexible provisioning of web service workflows

ACM Transactions on Internet Technology (TOIT)
A Self-stabilizing Approximation for the Minimum Connected Dominating Set with Safe Convergence

OPODIS '08 Proceedings of the 12th International Conference on Principles of Distributed Systems
Computing the fault tolerance of multi-agent deployment

Artificial Intelligence
Characterizing fault tolerance in genetic programming

BADS '09 Proceedings of the 2009 workshop on Bio-inspired algorithms for distributed systems
An ecological approach to agent population management

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 1
Probabilistically survivable MASs

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Automating the addition of fault tolerance with discrete controller synthesis

Formal Methods in System Design
Logical Specification and Analysis of Fault Tolerant Systems Through Partial Model Checking

Electronic Notes in Theoretical Computer Science (ENTCS)
Reliability and availability analysis of self-stabilizing systems

SSS'06 Proceedings of the 8th international conference on Stabilization, safety, and security of distributed systems
A fault-tolerant software architecture for component-based systems

Architecting dependable systems
Characterizing fault tolerance in genetic programming

Future Generation Computer Systems
Performance evaluation of fault tolerance techniques in grid computing system

Computers and Electrical Engineering
A self-stabilizing minimal dominating set algorithm with safe convergence

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Complexity issues in automated model revision without explicit legitimate state

SSS'10 Proceedings of the 12th international conference on Stabilization, safety, and security of distributed systems
Quality factors for dynamic evolution in composition-based distributed applications

ACM SIGMIS Database
A hybrid fault tolerance technique in grid computing system

The Journal of Supercomputing
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Exception handling and asynchronous active objects: issues and proposal

Advanced Topics in Exception Handling Techniques
Ecology based decentralized agent management system

FAABS'04 Proceedings of the Third international conference on Formal Approaches to Agent-Based Systems
A task replication and fair resource management scheme for fault tolerant grids

EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
A formal model for fault-tolerance in distributed systems

SAFECOMP'05 Proceedings of the 24th international conference on Computer Safety, Reliability, and Security
A self-stabilizing 6-approximation for the minimum connected dominating set with safe convergence in unit disk graphs

Theoretical Computer Science
Non-functional analysis of distributed systems in unreliable environments using stochastic object based graph grammars

ICGT'06 Proceedings of the Third international conference on Graph Transformations
Verifying fault-tolerant distributed systems using object-based graph grammars

LADC'05 Proceedings of the Second Latin-American conference on Dependable Computing
Characterizing fault-tolerance of genetic algorithms in desktop grid systems

EvoCOP'10 Proceedings of the 10th European conference on Evolutionary Computation in Combinatorial Optimization
Enabling fault resilience for web services

Computer Communications
On time constraints of reliable broadcast protocols for ad hoc networks with the liveness property

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Reliable broadcast protocol independent of system parameters for ad hoc networks with liveness property

ADHOC-NOW'12 Proceedings of the 11th international conference on Ad-hoc, Mobile, and Wireless Networks
Replication based fault tolerant job scheduling strategy for economy driven grid

The Journal of Supercomputing
Reliability and performance optimization of pipelined real-time systems

Journal of Parallel and Distributed Computing
A self-healing distributed pervasive health system

International Journal of Web Engineering and Technology
Performance troubleshooting in data centers: an annotated bibliography?

ACM SIGOPS Operating Systems Review
Middleware design for physically-asynchronous logically-synchronous (PALS) systems

Proceedings of the Eleventh ACM International Conference on Embedded Software
Software health management with Bayesian networks

Innovations in Systems and Software Engineering
Computational Aspects of Uncertainty Profiles and Angel-Daemon Games

Theory of Computing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like fault, fault tolerance, and redundancy. This leads to four distinct forms of fault tolerance and to two main phases in achieving them: detection and correction. We show that this can help to reveal inherently fundamental structures that contribute to understanding and unifying methods and terminology. By doing this, we survey many existing methodologies and discuss their relations. The underlying system model is the close-to-reality asynchronous message-passing model of distributed computing.