Failure detection and consensus in the crash-recovery model

Authors:
Marcos Kawazoe Aguilera;Wei Chen;Sam Toueg
Affiliations:
Department of Computer Science, Cornell University, Ithaca, NY;Oracle Corporation, One Oracle Drive, Nashua, NH;Department of Computer Science, Cornell University, Ithaca, NY
Venue:
Distributed Computing
Year:
2000

Citing 8
Cited 46

Consensus in the presence of partial synchrony

Journal of the ACM (JACM)
Automatically increasing the fault-tolerance of distributed algorithms

Journal of Algorithms
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
The weakest failure detector for solving consensus

Journal of the ACM (JACM)
Distributed Algorithms

Distributed Algorithms
Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication

WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
Consensus in Asynchronous Systems Where Processes Can Crash and Recover

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
Failure Detectors in Omission Failure Environments

Failure Detectors in Omission Failure Environments

The Generic Consensus Service

IEEE Transactions on Software Engineering
On the Quality of Service of Failure Detectors

IEEE Transactions on Computers
On the Quality of Service of Failure Detectors

IEEE Transactions on Computers
A Versatile Family of Consensus Protocols Based on Chandra-Toueg's Unreliable Failure Detectors

IEEE Transactions on Computers
Deconstructing paxos

ACM SIGACT News
Fault-Tolerant Mobile Agent Execution

IEEE Transactions on Computers
Fast Indulgent Consensus with Zero Degradation

EDCC-4 Proceedings of the 4th European Dependable Computing Conference on Dependable Computing
Stable Leader Election

DISC '01 Proceedings of the 15th International Conference on Distributed Computing
On the Impact of Fast Failure Detectors on Real-Time Fault-Tolerant Systems

DISC '02 Proceedings of the 16th International Conference on Distributed Computing
How to Model Link Failures: A Perception-Based Fault Model

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
ACM SIGACT news distributed computing column 11

ACM SIGACT News
Atomic Broadcast in Asynchronous Crash-Recovery Distributed Systems and Its Use in Quorum-Based Replication

IEEE Transactions on Knowledge and Data Engineering
Randomized protocols for asynchronous consensus

Distributed Computing - Papers in celebration of the 20th anniversary of PODC
Distributed Diagnosis in Dynamic Fault Environments

IEEE Transactions on Parallel and Distributed Systems
Communication-efficient leader election and consensus with limited link synchrony

Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
Reliable and total order broadcast in the crash-recovery model

Journal of Parallel and Distributed Computing
Fault-scalable Byzantine fault-tolerant services

Proceedings of the twentieth ACM symposium on Operating systems principles
Construction of a fault-tolerant wireless communication topology using distributed agreement

DIWANS '06 Proceedings of the 2006 workshop on Dependability issues in wireless ad hoc networks and sensor networks
Harmful dogmas in fault tolerant distributed computing

ACM SIGACT News
On modeling and tolerating incorrect software

Journal of High Speed Networks - Self-Stabilizing Systems, Part 2
Agreement in synchronous networks with ubiquitous faults

Theoretical Computer Science
Easy Consensus Algorithms for the Crash-Recovery Model

DISC '08 Proceedings of the 22nd international symposium on Distributed Computing
Safe termination detection in an asynchronous distributed system when processes may crash and recover

Theoretical Computer Science
Implementing the Omega failure detector in the crash-recovery failure model

Journal of Computer and System Sciences
Comparative analysis of quality of service and memory usage for adaptive failure detectors in healthcare systems

IEEE Journal on Selected Areas in Communications - Special issue on wireless and pervasive communications for healthcare
A simple and communication-efficient Omega algorithm in the crash-recovery model

Information Processing Letters
Semi-passive replication and Lazy Consensus

Journal of Parallel and Distributed Computing
Randomization can be a healer: consensus with dynamic omission failures

DISC'09 Proceedings of the 23rd international conference on Distributed computing
The failure detector abstraction

ACM Computing Surveys (CSUR)
A new approach to fault-tolerant mobile agent execution in distributed systems

EC'05 Proceedings of the 6th WSEAS international conference on Evolutionary computing
Modeling fault-tolerant and reliable mobile agent execution in distributed systems

EC'05 Proceedings of the 6th WSEAS international conference on Evolutionary computing
A new approach for evaluation fault-tolerant mobile agent execution in distributed systems

EC'05 Proceedings of the 6th WSEAS international conference on Evolutionary computing
A new approach for evaluation fault-tolerant mobile agent execution in distributed systems

EC'05 Proceedings of the 6th WSEAS international conference on Evolutionary computing
Synchronous consensus under hybrid process and link failures

Theoretical Computer Science
Multi-writer regular registers in dynamic distributed systems with byzantine failures

Proceedings of the 3rd International Workshop on Theoretical Aspects of Dynamic Distributed Systems
Communication-efficient leader election in crash-recovery systems

Journal of Systems and Software
An algorithm for implementing BFT registers in distributed systems with bounded churn

SSS'11 Proceedings of the 13th international conference on Stabilization, safety, and security of distributed systems
Safe termination detection in an asynchronous distributed system when processes may crash and recover

OPODIS'06 Proceedings of the 10th international conference on Principles of Distributed Systems
Failure detection with booting in partially synchronous systems

EDCC'05 Proceedings of the 5th European conference on Dependable Computing
Majority and unanimity in synchronous networks with ubiquitous dynamic faults

SIROCCO'05 Proceedings of the 12th international conference on Structural Information and Communication Complexity
A practical distributed mutual exclusion protocol in dynamic peer-to-peer systems

IPTPS'04 Proceedings of the Third international conference on Peer-to-Peer Systems
Randomized wait-free consensus using an atomicity assumption

OPODIS'05 Proceedings of the 9th international conference on Principles of Distributed Systems
Advances in the design and implementation of group communication middleware

Dependable Systems
Stumbling over consensus research: misunderstandings and issues

Replication
On detecting termination in the crash-recovery model

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
A protocol for implementing byzantine storage in churn-prone distributed systems

Theoretical Computer Science

Quantified Score

Hi-index	0.02

Visualization

Abstract

We study the problems of failure detection and consensus in asynchronous systems in which processes may crash and recover, and links may lose messages. We first propose new failure detectors that are particularly suitable to the crash-recovery model. We next determine under what conditions stable storage is necessary to solve consensus in this model. Using the new failure detectors, we give two consensus algorithms that match these conditions: one requires stable storage and the other does not. Both algorithms tolerate link failures and are particularly efficient in the runs that are most likely in practice - those with no failures or failure detector mistakes. In such runs, consensus is achieved within 3δ time and with 4n messages, where δ is the maximum message delay and n is the number of processes in the system.