Simulative performance analysis of gossip failure detection for scalable distributed systems

Authors:
Mark W. Burns;Alan D. George;Bradley A. Wallace
Affiliations:
High-performance Computing and Simulation (HCS) Research Laboratory, Department of Electrical and Computer Engineering, University of Florida, P.O. Box 116200, Gainesville, FL 32611-6200, USA;High-performance Computing and Simulation (HCS) Research Laboratory, Department of Electrical and Computer Engineering, University of Florida, P.O. Box 116200, Gainesville, FL 32611-6200, USA;High-performance Computing and Simulation (HCS) Research Laboratory, Department of Electrical and Computer Engineering, University of Florida, P.O. Box 116200, Gainesville, FL 32611-6200, USA
Venue:
Cluster Computing
Year:
1999

Citing 13
Cited 7

Distributed systems

Distributed systems
The process group approach to reliable distributed computing

Communications of the ACM
Fault-tolerant computer system design

Fault-tolerant computer system design
The weakest failure detector for solving consensus

Journal of the ACM (JACM)
Implementing Fail-Silent Nodes for Distributed Systems

IEEE Transactions on Computers
On the impossibility of group membership

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
In search of clusters (2nd ed.)

In search of clusters (2nd ed.)
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
Broadcast Protocols for Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
The Design of the Transis System

Selected Papers from the International Workshop on Theory and Practice in Distributed Systems
Design and Performance of Horus: A Lightweight Group Communications System

Design and Performance of Horus: A Lightweight Group Communications System
A gossip-style failure detection service

Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing

Gossip-Style Failure Detection and Distributed Consensus for Scalable Heterogeneous Clusters

Cluster Computing
Experiences in modeling and simulation of computer architectures in DEVS

Transactions of the Society for Computer Simulation International - Recent advances in DEVS methodology--part II
Performance Analysis of Flat and Layered Gossip Services for Failure Detection and Consensus in Scalable Heterogeneous Clusters

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Achieving Scalable Cluster System Analysis and Management with a Gossip-Based Network Service

LCN '01 Proceedings of the 26th Annual IEEE Conference on Local Computer Networks
Détection de partition pour la gestion de groupes en environnement mobile

UbiMob '05 Proceedings of the 2nd French-speaking conference on Mobility and ubiquity computing
GEMS: Gossip-Enabled Monitoring Service for Scalable Heterogeneous Distributed Systems

Cluster Computing
Asynchronous failed sensor node detection method for sensor networks

International Journal of Network Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Three protocols for gossip-based failure detection services in large-scale heterogeneous clusters are analyzed and compared. The basic gossip protocol provides a means by which failures can be detected in large distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. The hierarchical protocol leverages the underlying network topology to achieve faster failure detection. In addition to studying the effectiveness and efficiency of these two agreement protocols, we propose a third protocol that extends the hierarchical approach by piggybacking gossip information on application-generated messages. The protocols are simulated and evaluated with a fault-injection model for scalable distributed systems comprised of clusters of workstations connected by high-performance networks, such as the CPlant system at Sandia National Laboratories. The model supports permanent and transient node and link failures, with rates specified at simulation time, for processors functioning in a fail-silent fashion. Through high-fidelity, CAD-based modeling and simulation, we demonstrate the strengths and weaknesses of each approach in terms of agreement time, number of gossips, and overall scalability.