Gossip-Style Failure Detection and Distributed Consensus for Scalable Heterogeneous Clusters

  • Authors:
  • Sridharan Ranganathan;Alan D. George;Robert W. Todd;Matthew C. Chidester

  • Affiliations:
  • High-performance Computing and Simulation (HCS) Research Laboratory, Department of Electrical and Computer Engineering, University of Florida, PO Box 116200, Gainesville, FL 32611-6200, USA;High-performance Computing and Simulation (HCS) Research Laboratory, Department of Electrical and Computer Engineering, University of Florida, PO Box 116200, Gainesville, FL 32611-6200, USA;High-performance Computing and Simulation (HCS) Research Laboratory, Department of Electrical and Computer Engineering, University of Florida, PO Box 116200, Gainesville, FL 32611-6200, USA;High-performance Computing and Simulation (HCS) Research Laboratory, Department of Electrical and Computer Engineering, University of Florida, PO Box 116200, Gainesville, FL 32611-6200, USA

  • Venue:
  • Cluster Computing
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Gossip protocols provide a means by which failures can be detected in large, distributed systems in an asynchronous manner without the limits associated with reliable multicasting for group communications. However, in order to be effective with application recovery and reconfiguration, these protocols require mechanisms by which failures can be detected with system-wide consensus in a scalable fashion. This paper presents three new gossip-style protocols supported by a novel algorithm to achieve consensus in scalable, heterogeneous clusters. The round-robin protocol improves on basic randomized gossiping by distributing gossip messages in a deterministic order that optimizes bandwidth consumption. Redundant gossiping is completely eliminated in the binary round-robin protocol, and the round-robin with sequence check protocol is a useful extension that yields efficient detection times without the need for system-specific optimization. The distributed consensus algorithm works with these gossip protocols to achieve agreement among the operable nodes in the cluster on the state of the system featuring either a flat or a layered design. The various protocols are simulated and evaluated in terms of consensus time and scalability using a high-fidelity, fault-injection model for distributed systems comprised of clusters of workstations connected by high-performance networks.