Perfect Failure Detection in Timed Asynchronous Systems

Authors:
Christof Fetzer
Affiliations:
-
Venue:
IEEE Transactions on Computers
Year:
2003

Citing 14
Cited 14

Leases: an efficient fault-tolerant mechanism for distributed file cache consistency

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Unreliable failure detectors for asynchronous systems (preliminary version)

PODC '91 Proceedings of the tenth annual ACM symposium on Principles of distributed computing
The weakest failure detector for solving consensus

PODC '92 Proceedings of the eleventh annual ACM symposium on Principles of distributed computing
Understanding the limitations of causally and totally ordered communication

SOSP '93 Proceedings of the fourteenth ACM symposium on Operating systems principles
Impossibility of distributed consensus with one faulty process

Journal of the ACM (JACM)
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
The Timed Asynchronous Distributed System Model

IEEE Transactions on Parallel and Distributed Systems
Replication and fault-tolerance in the ISIS system

Proceedings of the tenth ACM symposium on Operating systems principles
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Approximate Real-Time Clocks for Scheduled Events

ISORC '02 Proceedings of the Fifth IEEE International Symposium on Object-Oriented Real-Time Distributed Computing
Rejuvenation and Failure Detection in Partitionable Systems

PRDC '01 Proceedings of the 2001 Pacific Rim International Symposium on Dependable Computing
Scalability and failure recovery in a linux cluster file system

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4

Total order broadcast and multicast algorithms: Taxonomy and survey

ACM Computing Surveys (CSUR)
A Cheap and Safe COTS Wormhole for Local Area Networks

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Travelling through wormholes: a new look at distributed systems models

ACM SIGACT News
The notification based approach to implementing failure detectors in distributed systems

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Evaluation of the QoS of crash-recovery failure detection

Proceedings of the 2007 ACM symposium on Applied computing
Crash-only software

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Session state: beyond soft state

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Global data computation in chordal rings

Journal of Parallel and Distributed Computing
Semi-passive replication and Lazy Consensus

Journal of Parallel and Distributed Computing
No time for asynchrony

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Detecting failures in distributed systems with the Falcon spy network

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Asynchronous failed sensor node detection method for sensor networks

International Journal of Network Management
Exploiting partitioned synchrony to implement accurate failure detectors

International Journal of Critical Computer-Based Systems
Improving availability in distributed systems with failure informers

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	14.99

Visualization

Abstract

Perfect failure detectors can correctly decide whether a computer is crashed. However, it is impossible to implement a perfect failure detector in purely asynchronous systems. We show how to enforce perfect failure detection in timed asynchronous systems with hardware watchdogs. The two main system model assumptions are 1) each computer can measure time intervals with a known maximum error and 2) each computer has a watchdog that crashes the computer unless the watchdog is periodically updated. We have implemented a system that satisfies both assumptions using a combination of off-the-shelf software and hardware. To implement a perfect failure detector for process crash failures, we show that, in some systems, a hardware watchdog is actually not necessary.