On scalable and efficient distributed failure detectors

  • Authors:
  • Indranil Gupta;Tushar D. Chandra;Germán S. Goldszmidt

  • Affiliations:
  • Cornell Univ., Ithaca, NY;IBM T.J. watson Research Center, Yorktown Heights, NY;IBM T.J. watson Research Center, Yorktown Heights, NY0

  • Venue:
  • Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Process groups in distributed applications and services rely on failure detectors to detect process failures completely, and as quickly, accurately, and scalably as possible, even in the face of unreliable message deliveries. In this paper, we look at quantifying the optimal scalability, in terms of network load, (in messages per second, with messages having a size limit) of distributed, complete failure detectors as a function of application-specified requirements. These requirements are 1) quick failure detection by some non-faulty process, and 2) accuracy of failure detection. We assume a crash-recovery (non-Byzantine) failure model, and a network model that is probabilistically unreliable (w.r.t. message deliveries and process failures). First, we characterize, under certain independence assumptions, the optimum worst-case network load imposed by any failure detector that achieves an application's requirements. We then discuss why traditional heart beating schemes are inherently unscalable according to the optimal load. We also present a randomized, distributed, failure detector algorithm that imposes an equal expected load per group member. This protocol satisfies the application defined constraints of completeness and accuracy, and speed of detection on an average. It imposes a network load that differs frown the optimal by a sub-optimality factor that is much lower than that for traditional distributed heartbeating schemes. Moreover, this sub-optimality factor does not vary with group size (for large groups).