Reaching approximate agreement in the presence of faults
Journal of the ACM (JACM)
Easy impossibility proofs for distributed consensus problems
Distributed Computing
Distributed Computing
On the minimal synchronism needed for distributed consensus
Journal of the ACM (JACM)
Concurrency control and recovery in database systems
Concurrency control and recovery in database systems
Consensus in the presence of partial synchrony
Journal of the ACM (JACM)
A case for redundant arrays of inexpensive disks (RAID)
SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Knowledge and common knowledge in a distributed environment
Journal of the ACM (JACM)
Implementing fault-tolerant services using the state machine approach: a tutorial
ACM Computing Surveys (CSUR)
Understanding fault-tolerant distributed systems
Communications of the ACM
Distributed computing: models and methods
Handbook of theoretical computer science (vol. B)
Consistent detection of global predicates
PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
Reliable computer systems (2nd ed.): design and evaluation
Reliable computer systems (2nd ed.): design and evaluation
ACM Computing Surveys (CSUR)
The consensus problem in fault-tolerant computing
ACM Computing Surveys (CSUR)
The process group approach to reliable distributed computing
Communications of the ACM
A foundation of fault-tolerant computing
A foundation of fault-tolerant computing
Closure and Convergence: A Foundation of Fault-Tolerant Computing
IEEE Transactions on Software Engineering - Special issue on software reliability
Fault tolerance in distributed systems
Fault tolerance in distributed systems
Local and temporal predicates in distributed systems
ACM Transactions on Programming Languages and Systems (TOPLAS)
Introduction to distributed algorithms
Introduction to distributed algorithms
Impossibility of distributed consensus with one faulty process
Journal of the ACM (JACM)
Asynchronous consensus and broadcast protocols
Journal of the ACM (JACM)
Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
The weakest failure detector for solving consensus
Journal of the ACM (JACM)
Constraint satisfaction as a basis for designing nonmasking fault-tolerance
Journal of High Speed Networks
Detection of Strong Unstable Predicates in Distributed Programs
IEEE Transactions on Parallel and Distributed Systems
On the impossibility of group membership
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Component Based Design of Multitolerant Systems
IEEE Transactions on Software Engineering
Designing Masking Fault-Tolerance via Nonmasking Fault-Tolerance
IEEE Transactions on Software Engineering
Distributed systems (2nd Ed.)
What good are models and what models are good?
Distributed systems (2nd Ed.)
Consistent global states of distributed systems: fundamental concepts and mechanisms
Distributed systems (2nd Ed.)
The Byzantine Generals Problem
ACM Transactions on Programming Languages and Systems (TOPLAS)
Fail-stop processors: an approach to designing fault-tolerant computing systems
ACM Transactions on Computer Systems (TOCS)
The space shuttle primary computer system
Communications of the ACM
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Guarded commands, nondeterminacy and formal derivation of programs
Communications of the ACM
Self-stabilizing systems in spite of distributed control
Communications of the ACM
Distributed Algorithms
Fault Injection Techniques and Tools
Computer
IEEE Transactions on Computers
Specifying Graceful Degradation
IEEE Transactions on Parallel and Distributed Systems
Detection of Weak Unstable Predicates in Distributed Programs
IEEE Transactions on Parallel and Distributed Systems
Detection of Global State Predicates
WDAG '91 Proceedings of the 5th International Workshop on Distributed Algorithms
Primary Partition "Virtually-Synchronous Communication" harder than Consensus
WDAG '94 Proceedings of the 8th International Workshop on Distributed Algorithms
Replicated File Management in Large-Scale Distributed Systems
WDAG '94 Proceedings of the 8th International Workshop on Distributed Algorithms
On Real-Time and Non Real-Time Distributed Computing
WDAG '95 Proceedings of the 9th International Workshop on Distributed Algorithms
Faster Possibility Detection by Combining Two Approaches
WDAG '95 Proceedings of the 9th International Workshop on Distributed Algorithms
Simulating Reliable Links with Unreliable Links in the Presence of Process Crashes
WDAG '96 Proceedings of the 10th International Workshop on Distributed Algorithms
Detecting Global Predicates in Distributed Systems with Clocks
WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication
WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
Failure Detection and Consensus in the Crash-Recovery Model
DISC '98 Proceedings of the 12th International Symposium on Distributed Computing
Compositional Design of Multitolerant Repetitive Byzantine Agreement
Proceedings of the 17th Conference on Foundations of Software Technology and Theoretical Computer Science
Verifying Fault Tolerance of Distributed Algorithms Formally - An Example
CSD '98 Proceedings of the 1998 International Conference on Application of Concurrency to System Design
The redundancy mechanisms of the Ariane 5 Operational Control Center
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
The Timed Asynchronous Distributed System Model
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
The SunSCALR Framework for Internet Servers
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
How Fail-Stop are Faulty Programs?
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Consensus: The Big Misunderstanding
FTDCS '97 Proceedings of the 6th IEEE Workshop on Future Trends of Distributed Computing Systems
Using Light-Weight Groups to Handle Timing Failures in Quasi-Synchronous Systems
RTSS '98 Proceedings of the IEEE Real-Time Systems Symposium
Non blocking atomic commitment with an unreliable failure detector
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Distributed Predicate Detection in a Faulty Environment
ICDCS '98 Proceedings of the The 18th International Conference on Distributed Computing Systems
Detectors and Correctors: A Theory of Fault-Tolerance Components
ICDCS '98 Proceedings of the The 18th International Conference on Distributed Computing Systems
A Modular Approach to Fault-Tolerant Broadcasts and Related Problems
A Modular Approach to Fault-Tolerant Broadcasts and Related Problems
Election Vs. Consensus in Asynchronous Systems
Election Vs. Consensus in Asynchronous Systems
Solving Problems in the Presence of Process Crashes and Lossy Links
Solving Problems in the Presence of Process Crashes and Lossy Links
Failure Detectors in Omission Failure Environments
Failure Detectors in Omission Failure Environments
Quiescent Reliable Communication and Quiescent Consensus in Partitionable Networks
Quiescent Reliable Communication and Quiescent Consensus in Partitionable Networks
On the Weakest Failure Detector for Quiescent Reliable Communication
On the Weakest Failure Detector for Quiescent Reliable Communication
Partitionable Group Membership: Specification and Algorithms
Partitionable Group Membership: Specification and Algorithms
ACCESSING REPLICATED DATA IN A LARGE-SCALE DISTRIBUTED SYSTEM (M.S. Thesis)
ACCESSING REPLICATED DATA IN A LARGE-SCALE DISTRIBUTED SYSTEM (M.S. Thesis)
Detection of global predicates: techniques and their limitations
Distributed Computing
Synchronous, asynchronous, and causally ordered communication
Distributed Computing
Detecting causal relationships in distributed computations: in search of the holy grail
Distributed Computing
Software engineering for safety: a roadmap
Proceedings of the Conference on The Future of Software Engineering
Handling Obstacles in Goal-Oriented Requirements Engineering
IEEE Transactions on Software Engineering - special section on current trends in exception handling—part II
Realizing fault resilience in Web-server cluster
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Journal of Parallel and Distributed Computing - Problems in parallel and distributed computing: Solutions based on evolutionary paradigms
Advances in exception handling techniques
Distributed Peer-to-Peer Control in Harness
ICCS '02 Proceedings of the International Conference on Computational Science-Part II
Advances in Exception Handling Techniques (the book grow out of a ECOOP 2000 workshop)
A Model for Mobile Code Using Interacting Automata
IEEE Transactions on Mobile Computing
Error Scope on a Computational Grid: Theory and Practice
HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
The Knowledge Engineering Review
International Journal of Network Management
Schizophrenic middleware support for fault tolerance
Proceedings of the 2006 annual ACM SIGAda international conference on Ada
Efficient task replication and management for adaptive fault tolerance in mobile Grid environments
Future Generation Computer Systems - Special section: Information engineering and enterprise architecture in distributed computing environments
Specifying and using intrusion masking models to process distributed operations
Journal of Computer Security
FC-ORB: A robust distributed real-time embedded middleware with end-to-end utilization control
Journal of Systems and Software
Fault and adversary tolerance as an emergent property of distributed systems' software architectures
Proceedings of the 2007 workshop on Engineering fault tolerant systems
Communication analysis of distributed programs
Scientific Programming - Parallel/High-Performance Object-Oriented Scientific Computing (POOSC '05), Glasgow, UK, 25 July 2005
Flexible provisioning of web service workflows
ACM Transactions on Internet Technology (TOIT)
A Self-stabilizing Approximation for the Minimum Connected Dominating Set with Safe Convergence
OPODIS '08 Proceedings of the 12th International Conference on Principles of Distributed Systems
Computing the fault tolerance of multi-agent deployment
Artificial Intelligence
Characterizing fault tolerance in genetic programming
BADS '09 Proceedings of the 2009 workshop on Bio-inspired algorithms for distributed systems
An ecological approach to agent population management
AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 1
Probabilistically survivable MASs
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Automating the addition of fault tolerance with discrete controller synthesis
Formal Methods in System Design
Logical Specification and Analysis of Fault Tolerant Systems Through Partial Model Checking
Electronic Notes in Theoretical Computer Science (ENTCS)
Reliability and availability analysis of self-stabilizing systems
SSS'06 Proceedings of the 8th international conference on Stabilization, safety, and security of distributed systems
A fault-tolerant software architecture for component-based systems
Architecting dependable systems
Characterizing fault tolerance in genetic programming
Future Generation Computer Systems
Performance evaluation of fault tolerance techniques in grid computing system
Computers and Electrical Engineering
A self-stabilizing minimal dominating set algorithm with safe convergence
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Complexity issues in automated model revision without explicit legitimate state
SSS'10 Proceedings of the 12th international conference on Stabilization, safety, and security of distributed systems
A hybrid fault tolerance technique in grid computing system
The Journal of Supercomputing
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Exception handling and asynchronous active objects: issues and proposal
Advanced Topics in Exception Handling Techniques
Ecology based decentralized agent management system
FAABS'04 Proceedings of the Third international conference on Formal Approaches to Agent-Based Systems
A task replication and fair resource management scheme for fault tolerant grids
EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
A formal model for fault-tolerance in distributed systems
SAFECOMP'05 Proceedings of the 24th international conference on Computer Safety, Reliability, and Security
ICGT'06 Proceedings of the Third international conference on Graph Transformations
Verifying fault-tolerant distributed systems using object-based graph grammars
LADC'05 Proceedings of the Second Latin-American conference on Dependable Computing
Characterizing fault-tolerance of genetic algorithms in desktop grid systems
EvoCOP'10 Proceedings of the 10th European conference on Evolutionary Computation in Combinatorial Optimization
Enabling fault resilience for web services
Computer Communications
On time constraints of reliable broadcast protocols for ad hoc networks with the liveness property
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
ADHOC-NOW'12 Proceedings of the 11th international conference on Ad-hoc, Mobile, and Wireless Networks
Replication based fault tolerant job scheduling strategy for economy driven grid
The Journal of Supercomputing
Reliability and performance optimization of pipelined real-time systems
Journal of Parallel and Distributed Computing
A self-healing distributed pervasive health system
International Journal of Web Engineering and Technology
Performance troubleshooting in data centers: an annotated bibliography?
ACM SIGOPS Operating Systems Review
Middleware design for physically-asynchronous logically-synchronous (PALS) systems
Proceedings of the Eleventh ACM International Conference on Embedded Software
Software health management with Bayesian networks
Innovations in Systems and Software Engineering
Computational Aspects of Uncertainty Profiles and Angel-Daemon Games
Theory of Computing Systems
Hi-index | 0.00 |
Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like fault, fault tolerance, and redundancy. This leads to four distinct forms of fault tolerance and to two main phases in achieving them: detection and correction. We show that this can help to reveal inherently fundamental structures that contribute to understanding and unifying methods and terminology. By doing this, we survey many existing methodologies and discuss their relations. The underlying system model is the close-to-reality asynchronous message-passing model of distributed computing.