Latency and bandwidth-minimizing failure detectors

Authors:
Kelvin C. W. So;Emin Gün Sirer
Affiliations:
Cornell University, Ithaca, NY;Cornell University, Ithaca, NY
Venue:
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Year:
2007

Citing 26
Cited 5

Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Totem: a fault-tolerant multicast group communication system

Communications of the ACM
End-to-end Internet packet dynamics

SIGCOMM '97 Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication
Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs

Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Optimal implementation of the weakest failure detector for solving consensus (brief announcement)

Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing
On scalable and efficient distributed failure detectors

Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
On the Quality of Service of Failure Detectors

IEEE Transactions on Computers
Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication

WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
Failure Detection and Consensus in the Crash-Recovery Model

DISC '98 Proceedings of the 12th International Symposium on Distributed Computing
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems

Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
Implementation and Performance Evaluation of an Adaptable Failure Detector

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
A Fault Detection Service for Wide Area Distributed Computations

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Optimization Techniques for Replicating Corba Objects

WORDS '99 Proceedings of the Fourth International Workshop on Object-Oriented Real-Time Dependable Systems
An Adaptive Failure Detection Protocol

PRDC '01 Proceedings of the 2001 Pacific Rim International Symposium on Dependable Computing
The ensemble system

The ensemble system
On the quality of service of failure detectors

On the quality of service of failure detectors
Measuring and analyzing the characteristics of Napster and Gnutella hosts

Multimedia Systems
Performance and Dependability of Structured Peer-to-Peer Overlays

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
The design and implementation of a next generation name service for the internet

Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
Definition and Specification of Accrual Failure Detectors

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Minimizing churn in distributed systems

Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer communications
Handling churn in a DHT

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Bandwidth-efficient management of DHT routing tables

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
FUSE: lightweight guaranteed distributed failure notification

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Corona: a high performance publish-subscribe system for the world wide web

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
A gossip-style failure detection service

Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing

Replica placement for high availability in distributed stream processing systems

Proceedings of the second international conference on Distributed event-based systems
QoS self-configuring failure detectors for distributed systems

DAIS'10 Proceedings of the 10th IFIP WG 6.1 international conference on Distributed Applications and Interoperable Systems
Efficient cooperative backup with decentralized trust management

ACM Transactions on Storage (TOS)
Still Alive: Extending Keep-Alive Intervals in P2P Overlay Networks

Mobile Networks and Applications
Improving availability in distributed systems with failure informers

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Failure detectors are fundamental building blocks in distributed systems. Multi-node failure detectors, where the detector is tasked with monitoring N other nodes, play a critical role in overlay networks and peer-to-peer systems. In such networks, failures need to be detected quickly and with low overhead. Achieving these properties simultaneously poses a difficult tradeoff between detection latency and resource consumption. In this paper, we examine this central tradeoff, formalize it as an optimization problem and analytically derive the optimal closed form formulas for multi-node failure detectors. We provide two variants of the optimal solution for optimality metrics appropriate for two different deployment scenarios. √s-LM is a latency-minimizing optimal failure detector that achieves the lowest average failure detection latency given a fixed bandwidth constraint for system maintenance. √s-BM is a bandwidth-minimizing optimal failure detector that meets a desired detection latency target with the least amount of bandwidth consumed. We evaluate our optimal results with node lifetimes chosen from bimodal and Pareto distributions, as well as real-world trace data from PlanetLab hosts, web sites and Microsoft PCs. Compared to standard failure detectors in wide use, √s failure detectors reduce failure detection latencies by 40% on average for the same bandwidth consumption, or conversely, reduce the amount of bandwidth consumed by 30% for the same failure detection latency.