Enforcing Perfect Failure Detection

Authors:
Affiliations:
Venue:
ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Year:
2001

Citing 9
Cited 4

Leases: an efficient fault-tolerant mechanism for distributed file cache consistency

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Unreliable failure detectors for asynchronous systems (preliminary version)

PODC '91 Proceedings of the tenth annual ACM symposium on Principles of distributed computing
The weakest failure detector for solving consensus

PODC '92 Proceedings of the eleventh annual ACM symposium on Principles of distributed computing
Impossibility of distributed consensus with one faulty process

Journal of the ACM (JACM)
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
The Timed Asynchronous Distributed System Model

IEEE Transactions on Parallel and Distributed Systems
Replication and fault-tolerance in the ISIS system

Proceedings of the tenth ACM symposium on Operating systems principles
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM

Failure Detection Lower Bounds on Registers and Consensus

DISC '02 Proceedings of the 16th International Conference on Distributed Computing
Three-tier replication for FT-CORBA infrastructures

Software—Practice & Experience
An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Asynchronous failed sensor node detection method for sensor networks

International Journal of Network Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Abstract: Perfect failure detectors can correctly decide whether a computer is crashed. However, it is impossible to implement a perfect failure detector in purely asynchronous systems. We show how to enforce perfect failure detection in timed distributed systems with hardware watchdogs. The two main system model assumptions are (1) each computer can measure time intervals with a known maximum error, and (2) each computer has a watchdog that crashes the computer unless the watchdog is periodically updated. We have implemented a system that satisfies both assumptions using a combination of off-the-shelf software and hardware.