On the Quality of Service of Failure Detectors

Authors:
Wei Chen;Sam Toueg;Marcos Kawazoe Aguilera
Affiliations:
Oracle Corp., Nashua, NH;Univ. of Toronto, Ont., Canada;Compaq systems research Center, Alto, CA
Venue:
IEEE Transactions on Computers
Year:
2002

Citing 18
Cited 24

Probability, statistics, and queueing theory with computer science applications

Probability, statistics, and queueing theory with computer science applications
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Totem: a fault-tolerant multicast group communication system

Communications of the ACM
Horus: a flexible group communication system

Communications of the ACM
On the impossibility of group membership

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
In search of clusters (2nd ed.)

In search of clusters (2nd ed.)
Using the heartbeat failure detector for quiescent reliable communication and consensus in partitionable networks

Theoretical Computer Science
On Quiescent Reliable Communication

SIAM Journal on Computing
Reliable Distributed Computing with the ISIS Toolkit

Reliable Distributed Computing with the ISIS Toolkit
Probabilistic Clock Synchronization in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Non blocking atomic commitment with an unreliable failure detector

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Fail-aware failure detectors

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
A Fail-Aware Membership Service

SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
Accelerated Heartbeat Protocols

ICDCS '98 Proceedings of the The 18th International Conference on Distributed Computing Systems
Failure Detectors in Omission Failure Environments

Failure Detectors in Omission Failure Environments
The ensemble system

The ensemble system
On the quality of service of failure detectors

On the quality of service of failure detectors
Failure detection and consensus in the crash-recovery model

Distributed Computing

On the Impact of Fast Failure Detectors on Real-Time Fault-Tolerant Systems

DISC '02 Proceedings of the 16th International Conference on Distributed Computing
On implementing omega with weak reliability and synchrony assumptions

Proceedings of the twenty-second annual symposium on Principles of distributed computing
ALTER: first step towards dependable grids

Proceedings of the 2006 ACM symposium on Applied computing
Evaluation of the QoS of crash-recovery failure detection

Proceedings of the 2007 ACM symposium on Applied computing
A Scalable and Efficient Self-Organizing Failure Detector for Grid Applications

GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Scheduling distributable real-time threads in the presence of crash failures and message losses

Proceedings of the 2008 ACM symposium on Applied computing
Fast Scheduling of Distributable Real-Time Threads with Assured End-to-End Timeliness

Ada-Europe '08 Proceedings of the 13th Ada-Europe international conference on Reliable Software Technologies
Performance Evaluation of Heartbeat-Style Failure Detector over Proactive and Reactive Routing Protocols for Mobile Ad Hoc Network

APNOMS '08 Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management
Failure-aware resource management for high-availability computing clusters with distributed virtual machines

Journal of Parallel and Distributed Computing
Diet: new developments and recent results

Euro-Par'06 Proceedings of the CoreGRID 2006, UNICORE Summit 2006, Petascale Computational Biology and Bioinformatics conference on Parallel processing
Fuzzy-grey prediction based dynamic failure detector for distributed systems

ICA3PP'07 Proceedings of the 7th international conference on Algorithms and architectures for parallel processing
NN-SA based dynamic failure detector for services composition in distributed environment

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications - Volume Part II
Error detection framework for complex software systems

EWDC '11 Proceedings of the 13th European Workshop on Dependable Computing
OS-level hang detection in complex software systems

International Journal of Critical Computer-Based Systems
Experimental evaluation of a failure detection service based on a gossip strategy

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
Quantitative evaluation of distributed algorithms using the neko framework: the nekostat extension

LADC'05 Proceedings of the Second Latin-American conference on Dependable Computing
Adapting failure detectors to communication network load fluctuations using SNMP and artificial neural nets

LADC'05 Proceedings of the Second Latin-American conference on Dependable Computing
Modeling and evaluating the survivability of an intrusion tolerant database system

ESORICS'06 Proceedings of the 11th European conference on Research in Computer Security
Operating system support to detect application hangs

VECoS'08 Proceedings of the Second international conference on Verification and Evaluation of Computer and Communication Systems
Metronome: operating system level performance management via self-adaptive computing

Proceedings of the 49th Annual Design Automation Conference
Adaptare: Supporting automatic and dependable adaptation in dynamic environments

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
Timeout-based adaptive consensus: improving performance through adaptation

Proceedings of the 27th Annual ACM Symposium on Applied Computing
On affirmative adaptive failure detection

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
A Failure Detection System for Large Scale Distributed Systems

International Journal of Distributed Systems and Technologies

Quantified Score

Hi-index	14.98

Visualization

Abstract

Editor's Note: This paper unfortunately contains some errors which led to the paper being reprinted in the May 2002 issue. Please see IEEE Transactions on Computers, vol. 51, no. 5, May 2002, pp. 561-580 for the correct paper (available without subscription).We study the quality of service (QoS) of failure detectors. By QoS, we mean a specification that quantifies 1) how fast the failure detector detects actual failures and 2) how well it avoids false detections. We first propose a set of QoS metrics to specify failure detectors for systems with probabilistic behaviors, i.e., for systems where message delays and message losses follow some probability distributions. We then give a new failure detector algorithm and analyze its QoS in terms of the proposed metrics. We show that, among a large class of failure detectors, the new algorithm is optimal with respect to some of these QoS metrics. Given a set of failure detector QoS requirements, we show how to compute the parameters of our algorithm so that it satisfies these requirements and we show how this can be done even if the probabilistic behavior of the system is not known. We then present some simulation results that show that the new failure detector algorithm provides a better QoS than an algorithm that is commonly used in practice. Finally, we suggest some ways to make our failure detector adaptiveto changes in the probabilistic behavior of the network.