Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
Totem: a fault-tolerant multicast group communication system
Communications of the ACM
End-to-end Internet packet dynamics
SIGCOMM '97 Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication
Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs
Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Optimal implementation of the weakest failure detector for solving consensus (brief announcement)
Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing
On scalable and efficient distributed failure detectors
Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
On the Quality of Service of Failure Detectors
IEEE Transactions on Computers
Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication
WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
Failure Detection and Consensus in the Crash-Recovery Model
DISC '98 Proceedings of the 12th International Symposium on Distributed Computing
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems
Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
Implementation and Performance Evaluation of an Adaptable Failure Detector
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
A Fault Detection Service for Wide Area Distributed Computations
HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Optimization Techniques for Replicating Corba Objects
WORDS '99 Proceedings of the Fourth International Workshop on Object-Oriented Real-Time Dependable Systems
An Adaptive Failure Detection Protocol
PRDC '01 Proceedings of the 2001 Pacific Rim International Symposium on Dependable Computing
The ensemble system
On the quality of service of failure detectors
On the quality of service of failure detectors
Measuring and analyzing the characteristics of Napster and Gnutella hosts
Multimedia Systems
Performance and Dependability of Structured Peer-to-Peer Overlays
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
The design and implementation of a next generation name service for the internet
Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
Definition and Specification of Accrual Failure Detectors
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Minimizing churn in distributed systems
Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Bandwidth-efficient management of DHT routing tables
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
FUSE: lightweight guaranteed distributed failure notification
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Corona: a high performance publish-subscribe system for the world wide web
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
A gossip-style failure detection service
Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
Replica placement for high availability in distributed stream processing systems
Proceedings of the second international conference on Distributed event-based systems
QoS self-configuring failure detectors for distributed systems
DAIS'10 Proceedings of the 10th IFIP WG 6.1 international conference on Distributed Applications and Interoperable Systems
Efficient cooperative backup with decentralized trust management
ACM Transactions on Storage (TOS)
Still Alive: Extending Keep-Alive Intervals in P2P Overlay Networks
Mobile Networks and Applications
Improving availability in distributed systems with failure informers
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Hi-index | 0.00 |
Failure detectors are fundamental building blocks in distributed systems. Multi-node failure detectors, where the detector is tasked with monitoring N other nodes, play a critical role in overlay networks and peer-to-peer systems. In such networks, failures need to be detected quickly and with low overhead. Achieving these properties simultaneously poses a difficult tradeoff between detection latency and resource consumption. In this paper, we examine this central tradeoff, formalize it as an optimization problem and analytically derive the optimal closed form formulas for multi-node failure detectors. We provide two variants of the optimal solution for optimality metrics appropriate for two different deployment scenarios. √s-LM is a latency-minimizing optimal failure detector that achieves the lowest average failure detection latency given a fixed bandwidth constraint for system maintenance. √s-BM is a bandwidth-minimizing optimal failure detector that meets a desired detection latency target with the least amount of bandwidth consumed. We evaluate our optimal results with node lifetimes chosen from bimodal and Pareto distributions, as well as real-world trace data from PlanetLab hosts, web sites and Microsoft PCs. Compared to standard failure detectors in wide use, √s failure detectors reduce failure detection latencies by 40% on average for the same bandwidth consumption, or conversely, reduce the amount of bandwidth consumed by 30% for the same failure detection latency.