The process group approach to reliable distributed computing
Communications of the ACM
Impossibility of distributed consensus with one faulty process
Journal of the ACM (JACM)
Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
Fail-awareness in timed asynchronous systems
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
In search of clusters (2nd ed.)
In search of clusters (2nd ed.)
Optimal implementation of the weakest failure detector for solving consensus (brief announcement)
Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing
Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication
WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
A Probabilistically Correct Leader Election Protocol for Large Groups
DISC '00 Proceedings of the 14th International Conference on Distributed Computing
On the Quality of Service of Failure Detectors
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Probabilistic Analysis of a Group Failure Detection Protocol
WORDS '99 Proceedings of the Fourth International Workshop on Object-Oriented Real-Time Dependable Systems
A gossip-style failure detection service
Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
Probabilistic Queries in Large-Scale Networks
EDCC-4 Proceedings of the 4th European Dependable Computing Conference on Dependable Computing
Automated application-level checkpointing of MPI programs
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
On the Implementation of Unreliable Failure Detectors in Partially Synchronous Systems
IEEE Transactions on Computers
Failure Detection and Membership Management in Grid Environments
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
The notification based approach to implementing failure detectors in distributed systems
InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
GCS-MA: A group communication system for mobile agents
Journal of Network and Computer Applications
Evaluation of the QoS of crash-recovery failure detection
Proceedings of the 2007 ACM symposium on Applied computing
FUSE: lightweight guaranteed distributed failure notification
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
A Scalable and Efficient Self-Organizing Failure Detector for Grid Applications
GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Latency and bandwidth-minimizing failure detectors
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Dynamo: amazon's highly available key-value store
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
QoS management in distributed service oriented systems
PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
COHESION - A microkernel based Desktop Grid platform for irregular task-parallel applications
Future Generation Computer Systems
Failure Detection Service for Large Scale Systems
KES-AMSTA '07 Proceedings of the 1st KES International Symposium on Agent and Multi-Agent Systems: Technologies and Applications
Grouping algorithms for scalable self-monitoring distributed systems
Autonomics '08 Proceedings of the 2nd International Conference on Autonomic Computing and Communication Systems
Failure detectors for wireless sensor-actuator systems
Ad Hoc Networks
Design of the notification system for failure detectors
International Journal of High Performance Computing and Networking
IEEE Journal on Selected Areas in Communications - Special issue on wireless and pervasive communications for healthcare
Adaptive checkpointing strategy to tolerate faults in economy based grid
The Journal of Supercomputing
Skip ring topology in fast failure detection service
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Autonomous and scalable failure detection in distributed systems
International Journal of Autonomous and Adaptive Communications Systems
Journal of Intelligent Manufacturing
What model and what conditions to implement unreliable failure detectors in dynamic networks?
Proceedings of the 3rd International Workshop on Theoretical Aspects of Dynamic Distributed Systems
Experimental evaluation of a failure detection service based on a gossip strategy
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
Cross-layer cluster-based data dissemination for failure detection in MANETs
Proceedings of the 7th International Conference on Network and Services Management
Asynchronous failed sensor node detection method for sensor networks
International Journal of Network Management
Survey: Survey of fault tolerant techniques for grid
Computer Science Review
Implementation of the fault tolerance in computational grid using agents by meta-modelling approach
International Journal of Communication Networks and Distributed Systems
Autonomous, failure-resilient orchestration of distributed discrete event simulations
Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Hi-index | 0.00 |
Process groups in distributed applications and services rely on failure detectors to detect process failures completely, and as quickly, accurately, and scalably as possible, even in the face of unreliable message deliveries. In this paper, we look at quantifying the optimal scalability, in terms of network load, (in messages per second, with messages having a size limit) of distributed, complete failure detectors as a function of application-specified requirements. These requirements are 1) quick failure detection by some non-faulty process, and 2) accuracy of failure detection. We assume a crash-recovery (non-Byzantine) failure model, and a network model that is probabilistically unreliable (w.r.t. message deliveries and process failures). First, we characterize, under certain independence assumptions, the optimum worst-case network load imposed by any failure detector that achieves an application's requirements. We then discuss why traditional heart beating schemes are inherently unscalable according to the optimal load. We also present a randomized, distributed, failure detector algorithm that imposes an equal expected load per group member. This protocol satisfies the application defined constraints of completeness and accuracy, and speed of detection on an average. It imposes a network load that differs frown the optimal by a sub-optimality factor that is much lower than that for traditional distributed heartbeating schemes. Moreover, this sub-optimality factor does not vary with group size (for large groups).