Leases: an efficient fault-tolerant mechanism for distributed file cache consistency
SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Unreliable failure detectors for asynchronous systems (preliminary version)
PODC '91 Proceedings of the tenth annual ACM symposium on Principles of distributed computing
The weakest failure detector for solving consensus
PODC '92 Proceedings of the eleventh annual ACM symposium on Principles of distributed computing
Impossibility of distributed consensus with one faulty process
Journal of the ACM (JACM)
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
The Timed Asynchronous Distributed System Model
IEEE Transactions on Parallel and Distributed Systems
Replication and fault-tolerance in the ISIS system
Proceedings of the tenth ACM symposium on Operating systems principles
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Failure Detection Lower Bounds on Registers and Consensus
DISC '02 Proceedings of the 16th International Conference on Distributed Computing
Three-tier replication for FT-CORBA infrastructures
Software—Practice & Experience
An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Asynchronous failed sensor node detection method for sensor networks
International Journal of Network Management
Hi-index | 0.00 |
Abstract: Perfect failure detectors can correctly decide whether a computer is crashed. However, it is impossible to implement a perfect failure detector in purely asynchronous systems. We show how to enforce perfect failure detection in timed distributed systems with hardware watchdogs. The two main system model assumptions are (1) each computer can measure time intervals with a known maximum error, and (2) each computer has a watchdog that crashes the computer unless the watchdog is periodically updated. We have implemented a system that satisfies both assumptions using a combination of off-the-shelf software and hardware.