Probability, statistics, and queueing theory with computer science applications
Probability, statistics, and queueing theory with computer science applications
Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
Totem: a fault-tolerant multicast group communication system
Communications of the ACM
Horus: a flexible group communication system
Communications of the ACM
On the impossibility of group membership
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
In search of clusters (2nd ed.)
In search of clusters (2nd ed.)
Theoretical Computer Science
On Quiescent Reliable Communication
SIAM Journal on Computing
Reliable Distributed Computing with the ISIS Toolkit
Reliable Distributed Computing with the ISIS Toolkit
Probabilistic Clock Synchronization in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Non blocking atomic commitment with an unreliable failure detector
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
A Fail-Aware Membership Service
SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
Accelerated Heartbeat Protocols
ICDCS '98 Proceedings of the The 18th International Conference on Distributed Computing Systems
Failure Detectors in Omission Failure Environments
Failure Detectors in Omission Failure Environments
RELACS: A Communications Infrastructure for Constructing Reliable Applications in Large-Scale Distributed Systems
The ensemble system
On the quality of service of failure detectors
On the quality of service of failure detectors
Failure detection and consensus in the crash-recovery model
Distributed Computing
On the Implementation of Unreliable Failure Detectors in Partially Synchronous Systems
IEEE Transactions on Computers
From Set Membership to Group Membership: A Separation of Concerns
IEEE Transactions on Dependable and Secure Computing
The notification based approach to implementing failure detectors in distributed systems
InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Asynchronous bounded lifetime failure detectors
Information Processing Letters
IEEE Transactions on Computers
FUSE: lightweight guaranteed distributed failure notification
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Proceedings of the 16th international symposium on High performance distributed computing
Latency and bandwidth-minimizing failure detectors
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
A group membership service for large-scale grids
Proceedings of the 6th international workshop on Middleware for grid computing
ACM SIGACT News
Failure Detection Service for Large Scale Systems
KES-AMSTA '07 Proceedings of the 1st KES International Symposium on Agent and Multi-Agent Systems: Technologies and Applications
Design of the notification system for failure detectors
International Journal of High Performance Computing and Networking
Efficient algorithms for fault tolerant mobile agent execution
International Journal of High Performance Computing and Networking
Probabilistic models for access strategies to dynamic information elements
Performance Evaluation
IEEE Journal on Selected Areas in Communications - Special issue on wireless and pervasive communications for healthcare
Asynchronous bounded lifetime failure detectors
Information Processing Letters
On distributed real-time scheduling in networked embedded systems in the presence of crash failures
SEUS'07 Proceedings of the 5th IFIP WG 10.2 international conference on Software technologies for embedded and ubiquitous systems
Consensus-driven distributable thread scheduling in networked embedded systems
EUC'07 Proceedings of the 2007 international conference on Embedded and ubiquitous computing
KI'09 Proceedings of the 32nd annual German conference on Advances in artificial intelligence
Detecting failures in distributed systems with the Falcon spy network
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Parallel algorithms for fault-tolerant mobile agent execution
ICA3PP'05 Proceedings of the 6th international conference on Algorithms and Architectures for Parallel Processing
QoS self-configuring failure detectors for distributed systems
DAIS'10 Proceedings of the 10th IFIP WG 6.1 international conference on Distributed Applications and Interoperable Systems
Leader election for replicated services using application scores
Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
Still Alive: Extending Keep-Alive Intervals in P2P Overlay Networks
Mobile Networks and Applications
On the implementation of communication-optimal failure detectors
LADC'07 Proceedings of the Third Latin-American conference on Dependable Computing
Leader election for replicated services using application scores
Proceedings of the 12th International Middleware Conference
Improving availability in distributed systems with failure informers
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Hi-index | 14.99 |
We study the quality of service (QoS) of failure detectors. By QoS, we mean a specification that quantifies 1) how fast the failure detector detects actual failures and 2) how well it avoids false detections. We first propose a set of QoS metrics to specify failure detectors for systems with probabilistic behaviors, i.e., for systems where message delays and message losses follow some probability distributions. We then give a new failure detector algorithm and analyze its QoS in terms of the proposed metrics. We show that, among a large class of failure detectors, the new algorithm is optimal with respect to some of these QoS metrics. Given a set of failure detector QoS requirements, we show how to compute the parameters of our algorithm so that it satisfies these requirements and we show how this can be done even if the probabilistic behavior of the system is not known. We then present some simulation results that show that the new failure detector algorithm provides a better QoS than an algorithm that is commonly used in practice. Finally, we suggest some ways to make our failure detector adaptive to changes in the probabilistic behavior of the network.