On the Quality of Service of Failure Detectors

Authors:
Wei Chen;Sam Toueg;Marcos Kawazoe Aguilera
Affiliations:
Oracle Corp., Nashua, NH;Univ. of Toronto, Ontario, Canada;Compaq Systems Research Center, Palo Alto, CA
Venue:
IEEE Transactions on Computers
Year:
2002

Citing 19
Cited 28

Probability, statistics, and queueing theory with computer science applications

Probability, statistics, and queueing theory with computer science applications
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Totem: a fault-tolerant multicast group communication system

Communications of the ACM
Horus: a flexible group communication system

Communications of the ACM
On the impossibility of group membership

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
In search of clusters (2nd ed.)

In search of clusters (2nd ed.)
Using the heartbeat failure detector for quiescent reliable communication and consensus in partitionable networks

Theoretical Computer Science
On Quiescent Reliable Communication

SIAM Journal on Computing
Reliable Distributed Computing with the ISIS Toolkit

Reliable Distributed Computing with the ISIS Toolkit
Probabilistic Clock Synchronization in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Non blocking atomic commitment with an unreliable failure detector

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Fail-aware failure detectors

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
A Fail-Aware Membership Service

SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
Accelerated Heartbeat Protocols

ICDCS '98 Proceedings of the The 18th International Conference on Distributed Computing Systems
Failure Detectors in Omission Failure Environments

Failure Detectors in Omission Failure Environments
RELACS: A Communications Infrastructure for Constructing Reliable Applications in Large-Scale Distributed Systems

RELACS: A Communications Infrastructure for Constructing Reliable Applications in Large-Scale Distributed Systems
The ensemble system

The ensemble system
On the quality of service of failure detectors

On the quality of service of failure detectors
Failure detection and consensus in the crash-recovery model

Distributed Computing

On the Implementation of Unreliable Failure Detectors in Partially Synchronous Systems

IEEE Transactions on Computers
A short introduction to failure detectors for asynchronous distributed systems

ACM SIGACT News
From Set Membership to Group Membership: A Separation of Concerns

IEEE Transactions on Dependable and Secure Computing
The notification based approach to implementing failure detectors in distributed systems

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Asynchronous bounded lifetime failure detectors

Information Processing Letters
A Fault-Tolerant Group Communication Protocol in Large Scale and Highly Dynamic Mobile Next-Generation Networks

IEEE Transactions on Computers
FUSE: lightweight guaranteed distributed failure notification

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Environmentally responsible middleware:: an altruistic behavior model for distributed middleware components

Proceedings of the 16th international symposium on High performance distributed computing
Latency and bandwidth-minimizing failure detectors

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
A group membership service for large-scale grids

Proceedings of the 6th international workshop on Middleware for grid computing
Review of DSN'08

ACM SIGACT News
Failure Detection Service for Large Scale Systems

KES-AMSTA '07 Proceedings of the 1st KES International Symposium on Agent and Multi-Agent Systems: Technologies and Applications
Design of the notification system for failure detectors

International Journal of High Performance Computing and Networking
Efficient algorithms for fault tolerant mobile agent execution

International Journal of High Performance Computing and Networking
Probabilistic models for access strategies to dynamic information elements

Performance Evaluation
Comparative analysis of quality of service and memory usage for adaptive failure detectors in healthcare systems

IEEE Journal on Selected Areas in Communications - Special issue on wireless and pervasive communications for healthcare
Asynchronous bounded lifetime failure detectors

Information Processing Letters
On distributed real-time scheduling in networked embedded systems in the presence of crash failures

SEUS'07 Proceedings of the 5th IFIP WG 10.2 international conference on Software technologies for embedded and ubiquitous systems
Consensus-driven distributable thread scheduling in networked embedded systems

EUC'07 Proceedings of the 2007 international conference on Embedded and ubiquitous computing
Fault detection in discrete event based distributed systems by forecasting message sequences with neural networks

KI'09 Proceedings of the 32nd annual German conference on Advances in artificial intelligence
Detecting failures in distributed systems with the Falcon spy network

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Parallel algorithms for fault-tolerant mobile agent execution

ICA3PP'05 Proceedings of the 6th international conference on Algorithms and Architectures for Parallel Processing
QoS self-configuring failure detectors for distributed systems

DAIS'10 Proceedings of the 10th IFIP WG 6.1 international conference on Distributed Applications and Interoperable Systems
Leader election for replicated services using application scores

Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
Still Alive: Extending Keep-Alive Intervals in P2P Overlay Networks

Mobile Networks and Applications
On the implementation of communication-optimal failure detectors

LADC'07 Proceedings of the Third Latin-American conference on Dependable Computing
Leader election for replicated services using application scores

Proceedings of the 12th International Middleware Conference
Improving availability in distributed systems with failure informers

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	14.99

Visualization

Abstract

We study the quality of service (QoS) of failure detectors. By QoS, we mean a specification that quantifies 1) how fast the failure detector detects actual failures and 2) how well it avoids false detections. We first propose a set of QoS metrics to specify failure detectors for systems with probabilistic behaviors, i.e., for systems where message delays and message losses follow some probability distributions. We then give a new failure detector algorithm and analyze its QoS in terms of the proposed metrics. We show that, among a large class of failure detectors, the new algorithm is optimal with respect to some of these QoS metrics. Given a set of failure detector QoS requirements, we show how to compute the parameters of our algorithm so that it satisfies these requirements and we show how this can be done even if the probabilistic behavior of the system is not known. We then present some simulation results that show that the new failure detector algorithm provides a better QoS than an algorithm that is commonly used in practice. Finally, we suggest some ways to make our failure detector adaptive to changes in the probabilistic behavior of the network.