The notification based approach to implementing failure detectors in distributed systems

Authors:
Jin Yang;Jiannong Cao;Weigang Wu;Corentin Travers
Affiliations:
Hong Kong Polytechnic University, Hung Hom, Kowloon Hong Kong;Hong Kong Polytechnic University, Hung Hom, Kowloon Hong Kong;Hong Kong Polytechnic University, Hung Hom, Kowloon Hong Kong;IRISA, University de Rennes, France
Venue:
InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Year:
2006

Citing 17
Cited 1

Reaching approximate agreement in the presence of faults

Journal of the ACM (JACM)
On the minimal synchronism needed for distributed consensus

Journal of the ACM (JACM)
Consensus in the presence of partial synchrony

Journal of the ACM (JACM)
Probability, statistics, and queueing theory with computer science applications

Probability, statistics, and queueing theory with computer science applications
Unreliable failure detectors for asynchronous systems (preliminary version)

PODC '91 Proceedings of the tenth annual ACM symposium on Principles of distributed computing
Failure detectors in omission failure environments

PODC '97 Proceedings of the sixteenth annual ACM symposium on Principles of distributed computing
Using the heartbeat failure detector for quiescent reliable communication and consensus in partitionable networks

Theoretical Computer Science
On scalable and efficient distributed failure detectors

Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
On the Quality of Service of Failure Detectors

IEEE Transactions on Computers
Perfect Failure Detection in Timed Asynchronous Systems

IEEE Transactions on Computers
Failure Detection and Consensus in the Crash-Recovery Model

DISC '98 Proceedings of the 12th International Symposium on Distributed Computing
Implementation and Performance Evaluation of an Adaptable Failure Detector

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Failure Detectors in Omission Failure Environments

Failure Detectors in Omission Failure Environments
A Markov Model for Quality of Service of Failure Detectors in the Pressure of Loss Bursts

AINA '04 Proceedings of the 18th International Conference on Advanced Information Networking and Applications - Volume 2
QoS of Timeout-Based Self-Tuned Failure Detectors: The Effects of the Communication Delay Predictor and the Safety Margin

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Experimental Evaluation of the QoS of Failure Detectors on Wide Area Network

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
A gossip-style failure detection service

Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing

Efficient algorithms for fault tolerant mobile agent execution

International Journal of High Performance Computing and Networking

Quantified Score

Hi-index	0.00

Visualization

Abstract

Failure Detector (FD) is the fundamental component of fault tolerant computer systems. In recent years, many research works have been done on the study of QoS and implementation of FDs for distributed computing environments. Almost all of these works are based on the heartbeat approach (HBFD). In this paper, we propose a general model for implementing FDs which separates the processes to be monitored from the underlying running environment. We identify the potential problems of HBFD approach and propose an alternative approach to implementing FDs, called notification based FD (NTFD). Instead of letting the process periodically send heartbeat messages to show it is still alive, in NTFD, the underlying watchdog mechanism sends failure notification messages only when the failure of a monitored process is detected locally. Compared with HBFD implementation under our model, NTFD is more efficient and scalable, and can guarantee the strong accuracy property. Trade-off of achieving QoS of FD is analyzed and the results show that NTFD has much higher probability to achieve a better balance between completeness and accuracy, yet provides a much lower probability of false report and lower system cost. Based on the analysis, we propose the design of a hybrid FD which combines the advantages of HBFD and NTFD.