Distributed systems
The process group approach to reliable distributed computing
Communications of the ACM
Fault-tolerant computer system design
Fault-tolerant computer system design
The weakest failure detector for solving consensus
Journal of the ACM (JACM)
Implementing Fail-Silent Nodes for Distributed Systems
IEEE Transactions on Computers
On the impossibility of group membership
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
In search of clusters (2nd ed.)
In search of clusters (2nd ed.)
Fail-stop processors: an approach to designing fault-tolerant computing systems
ACM Transactions on Computer Systems (TOCS)
Broadcast Protocols for Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
The Design of the Transis System
Selected Papers from the International Workshop on Theory and Practice in Distributed Systems
Design and Performance of Horus: A Lightweight Group Communications System
Design and Performance of Horus: A Lightweight Group Communications System
A gossip-style failure detection service
Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
Experiences in modeling and simulation of computer architectures in DEVS
Transactions of the Society for Computer Simulation International - Recent advances in DEVS methodology--part II
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Achieving Scalable Cluster System Analysis and Management with a Gossip-Based Network Service
LCN '01 Proceedings of the 26th Annual IEEE Conference on Local Computer Networks
Détection de partition pour la gestion de groupes en environnement mobile
UbiMob '05 Proceedings of the 2nd French-speaking conference on Mobility and ubiquity computing
Asynchronous failed sensor node detection method for sensor networks
International Journal of Network Management
Hi-index | 0.00 |
Three protocols for gossip-based failure detection services in large-scale heterogeneous clusters are analyzed and compared. The basic gossip protocol provides a means by which failures can be detected in large distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. The hierarchical protocol leverages the underlying network topology to achieve faster failure detection. In addition to studying the effectiveness and efficiency of these two agreement protocols, we propose a third protocol that extends the hierarchical approach by piggybacking gossip information on application-generated messages. The protocols are simulated and evaluated with a fault-injection model for scalable distributed systems comprised of clusters of workstations connected by high-performance networks, such as the CPlant system at Sandia National Laboratories. The model supports permanent and transient node and link failures, with rates specified at simulation time, for processors functioning in a fail-silent fashion. Through high-fidelity, CAD-based modeling and simulation, we demonstrate the strengths and weaknesses of each approach in terms of agreement time, number of gossips, and overall scalability.