Moshe: A group membership service for WANs
ACM Transactions on Computer Systems (TOCS)
Probabilistic Reliable Dissemination in Large-Scale Systems
IEEE Transactions on Parallel and Distributed Systems
A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
Détection de partition pour la gestion de groupes en environnement mobile
UbiMob '05 Proceedings of the 2nd French-speaking conference on Mobility and ubiquity computing
Scalable information dissemination for pervasive systems: implementation and evaluation
Proceedings of the 4th international workshop on Middleware for Pervasive and Ad-Hoc Computing (MPAC 2006)
Evaluation of the QoS of crash-recovery failure detection
Proceedings of the 2007 ACM symposium on Applied computing
Early experience with an internet broadcast system based on overlay multicast
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Compositional gossip: a conceptual architecture for designing gossip-based applications
ACM SIGOPS Operating Systems Review - Gossip-based computer networking
Grouping algorithms for scalable self-monitoring distributed systems
Autonomics '08 Proceedings of the 2nd International Conference on Autonomic Computing and Communication Systems
Failure detectors for wireless sensor-actuator systems
Ad Hoc Networks
VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A gossip-style failure detection service
Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
Self-healing network for scalable fault-tolerant runtime environments
Future Generation Computer Systems
SSS '09 Proceedings of the 11th International Symposium on Stabilization, Safety, and Security of Distributed Systems
Optimizing information flow in the gossip objects platform
ACM SIGOPS Operating Systems Review
Autonomous and scalable failure detection in distributed systems
International Journal of Autonomous and Adaptive Communications Systems
Gossiping for autonomic estimation of network-based parameters in dynamic environments
OTM'10 Proceedings of the 2010 international conference on On the move to meaningful internet systems
What model and what conditions to implement unreliable failure detectors in dynamic networks?
Proceedings of the 3rd International Workshop on Theoretical Aspects of Dynamic Distributed Systems
Experimental evaluation of a failure detection service based on a gossip strategy
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
Scalable fault tolerant protocol for parallel runtime environments
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
A peer-to-peer framework for robust execution of message passing parallel programs on grids
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Intelligent dependability services for overlay networks
DAIS'06 Proceedings of the 6th IFIP WG 6.1 international conference on Distributed Applications and Interoperable Systems
Bounded gossip: a gossip protocol for large-scale datacenters
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Hi-index | 0.00 |
Failure Detection is valuable for system management, replication, load balancing, and other distributed services. To date, Failure Detection Services scale badly in the number of members that are being monitored. This paper describes a new protocol based on gossiping that does scale well and provides timely detection. We analyze the protocol, and then extend it to discover and leverage the underlying network topology for much improved resource utilization. We then combine it with another protocol, based on broadcast, that is used to handle partition failures.