On scalable and efficient distributed failure detectors

Authors:
Indranil Gupta;Tushar D. Chandra;Germán S. Goldszmidt
Affiliations:
Cornell Univ., Ithaca, NY;IBM T.J. watson Research Center, Yorktown Heights, NY;IBM T.J. watson Research Center, Yorktown Heights, NY0
Venue:
Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
Year:
2001

Citing 11
Cited 31

The process group approach to reliable distributed computing

Communications of the ACM
Impossibility of distributed consensus with one faulty process

Journal of the ACM (JACM)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Fail-awareness in timed asynchronous systems

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
In search of clusters (2nd ed.)

In search of clusters (2nd ed.)
Optimal implementation of the weakest failure detector for solving consensus (brief announcement)

Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing
Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication

WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
A Probabilistically Correct Leader Election Protocol for Large Groups

DISC '00 Proceedings of the 14th International Conference on Distributed Computing
On the Quality of Service of Failure Detectors

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Probabilistic Analysis of a Group Failure Detection Protocol

WORDS '99 Proceedings of the Fourth International Workshop on Object-Oriented Real-Time Dependable Systems
A gossip-style failure detection service

Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing

Probabilistic Queries in Large-Scale Networks

EDCC-4 Proceedings of the 4th European Dependable Computing Conference on Dependable Computing
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
On the Implementation of Unreliable Failure Detectors in Partially Synchronous Systems

IEEE Transactions on Computers
Failure Detection and Membership Management in Grid Environments

GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
A short introduction to failure detectors for asynchronous distributed systems

ACM SIGACT News
The notification based approach to implementing failure detectors in distributed systems

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
GCS-MA: A group communication system for mobile agents

Journal of Network and Computer Applications
Evaluation of the QoS of crash-recovery failure detection

Proceedings of the 2007 ACM symposium on Applied computing
FUSE: lightweight guaranteed distributed failure notification

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A Scalable and Efficient Self-Organizing Failure Detector for Grid Applications

GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Latency and bandwidth-minimizing failure detectors

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
QoS management in distributed service oriented systems

PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
COHESION - A microkernel based Desktop Grid platform for irregular task-parallel applications

Future Generation Computer Systems
Failure Detection Service for Large Scale Systems

KES-AMSTA '07 Proceedings of the 1st KES International Symposium on Agent and Multi-Agent Systems: Technologies and Applications
Grouping algorithms for scalable self-monitoring distributed systems

Autonomics '08 Proceedings of the 2nd International Conference on Autonomic Computing and Communication Systems
Failure detectors for wireless sensor-actuator systems

Ad Hoc Networks
Design of the notification system for failure detectors

International Journal of High Performance Computing and Networking
Comparative analysis of quality of service and memory usage for adaptive failure detectors in healthcare systems

IEEE Journal on Selected Areas in Communications - Special issue on wireless and pervasive communications for healthcare
Adaptive checkpointing strategy to tolerate faults in economy based grid

The Journal of Supercomputing
Skip ring topology in fast failure detection service

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Autonomous and scalable failure detection in distributed systems

International Journal of Autonomous and Adaptive Communications Systems
A security management scheme for failure detector distributed systems based on self-tuning control theory

Journal of Intelligent Manufacturing
What model and what conditions to implement unreliable failure detectors in dynamic networks?

Proceedings of the 3rd International Workshop on Theoretical Aspects of Dynamic Distributed Systems
Experimental evaluation of a failure detection service based on a gossip strategy

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
Cross-layer cluster-based data dissemination for failure detection in MANETs

Proceedings of the 7th International Conference on Network and Services Management
Asynchronous failed sensor node detection method for sensor networks

International Journal of Network Management
Survey: Survey of fault tolerant techniques for grid

Computer Science Review
Implementation of the fault tolerance in computational grid using agents by meta-modelling approach

International Journal of Communication Networks and Distributed Systems
Autonomous, failure-resilient orchestration of distributed discrete event simulations

Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
Design and implementation of a scalable membership service for supercomputer resiliency-aware runtime

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Process groups in distributed applications and services rely on failure detectors to detect process failures completely, and as quickly, accurately, and scalably as possible, even in the face of unreliable message deliveries. In this paper, we look at quantifying the optimal scalability, in terms of network load, (in messages per second, with messages having a size limit) of distributed, complete failure detectors as a function of application-specified requirements. These requirements are 1) quick failure detection by some non-faulty process, and 2) accuracy of failure detection. We assume a crash-recovery (non-Byzantine) failure model, and a network model that is probabilistically unreliable (w.r.t. message deliveries and process failures). First, we characterize, under certain independence assumptions, the optimum worst-case network load imposed by any failure detector that achieves an application's requirements. We then discuss why traditional heart beating schemes are inherently unscalable according to the optimal load. We also present a randomized, distributed, failure detector algorithm that imposes an equal expected load per group member. This protocol satisfies the application defined constraints of completeness and accuracy, and speed of detection on an average. It imposes a network load that differs frown the optimal by a sub-optimality factor that is much lower than that for traditional distributed heartbeating schemes. Moreover, this sub-optimality factor does not vary with group size (for large groups).