Reaching approximate agreement in the presence of faults
Journal of the ACM (JACM)
On the minimal synchronism needed for distributed consensus
Journal of the ACM (JACM)
Consensus in the presence of partial synchrony
Journal of the ACM (JACM)
Probability, statistics, and queueing theory with computer science applications
Probability, statistics, and queueing theory with computer science applications
Unreliable failure detectors for asynchronous systems (preliminary version)
PODC '91 Proceedings of the tenth annual ACM symposium on Principles of distributed computing
Failure detectors in omission failure environments
PODC '97 Proceedings of the sixteenth annual ACM symposium on Principles of distributed computing
Theoretical Computer Science
On scalable and efficient distributed failure detectors
Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
On the Quality of Service of Failure Detectors
IEEE Transactions on Computers
Perfect Failure Detection in Timed Asynchronous Systems
IEEE Transactions on Computers
Failure Detection and Consensus in the Crash-Recovery Model
DISC '98 Proceedings of the 12th International Symposium on Distributed Computing
Implementation and Performance Evaluation of an Adaptable Failure Detector
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Failure Detectors in Omission Failure Environments
Failure Detectors in Omission Failure Environments
A Markov Model for Quality of Service of Failure Detectors in the Pressure of Loss Bursts
AINA '04 Proceedings of the 18th International Conference on Advanced Information Networking and Applications - Volume 2
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Experimental Evaluation of the QoS of Failure Detectors on Wide Area Network
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
A gossip-style failure detection service
Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
Efficient algorithms for fault tolerant mobile agent execution
International Journal of High Performance Computing and Networking
Hi-index | 0.00 |
Failure Detector (FD) is the fundamental component of fault tolerant computer systems. In recent years, many research works have been done on the study of QoS and implementation of FDs for distributed computing environments. Almost all of these works are based on the heartbeat approach (HBFD). In this paper, we propose a general model for implementing FDs which separates the processes to be monitored from the underlying running environment. We identify the potential problems of HBFD approach and propose an alternative approach to implementing FDs, called notification based FD (NTFD). Instead of letting the process periodically send heartbeat messages to show it is still alive, in NTFD, the underlying watchdog mechanism sends failure notification messages only when the failure of a monitored process is detected locally. Compared with HBFD implementation under our model, NTFD is more efficient and scalable, and can guarantee the strong accuracy property. Trade-off of achieving QoS of FD is analyzed and the results show that NTFD has much higher probability to achieve a better balance between completeness and accuracy, yet provides a much lower probability of false report and lower system cost. Based on the analysis, we propose the design of a hybrid FD which combines the advantages of HBFD and NTFD.