Crash-quiescent failure detection

  • Authors:
  • Srikanth Sastry;Scott M. Pike;Jennifer L. Welch

  • Affiliations:
  • Department of Computer Science and Engineering, Texas A&M University, College Station, TX;Department of Computer Science and Engineering, Texas A&M University, College Station, TX;Department of Computer Science and Engineering, Texas A&M University, College Station, TX

  • Venue:
  • DISC'09 Proceedings of the 23rd international conference on Distributed computing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

A distributed algorithm is crash quiescent if it eventually stops sending messages to crashed processes. An algorithm can be made crash quiescent by providing it with either a crash notification service or a reliable communication service. Both services can be implemented in practical environments with failure detectors. Therefore, crash-quiescent failure detection is fundamental to system-wide crash quiescence. We establish necessary and sufficient conditions for crash-quiescent failure detection in partially synchronous environments where a bounded, but unknown, number of consecutive messages can be arbitrarily late or lost. Without a correct majority of processes, not even the weakest oracle for fault-tolerant consensus, ⋄W, can be implemented crash quiescently. With a correct majority, however, the eventually perfect failure detector, ⋄P, is possible. Our ⋄P algorithm is correct in all runs, but improves performance via crash quiescence in any run with a correct majority. We also present a refinement of our ⋄P algorithm to mitigate the overhead of achieving crash quiescence; the resulting bit complexity per utilized link is asymptotically better than or equal to that of non-crash-quiescent counterparts.