A short introduction to failure detectors for asynchronous distributed systems

  • Authors:
  • Michel Reynal

  • Affiliations:
  • IRISA, Rennes Cedex, France,

  • Venue:
  • ACM SIGACT News
  • Year:
  • 2005

Quantified Score

Hi-index 0.01

Visualization

Abstract

Since the first version of Chandra and Toueg's seminal paper titled "Unreliable failure detectors for reliable distributed systems" in 1991, the failure detector concept has been extensively studied and investigated. This is not at all surprising as failure detection is pervasive in the design, the analysis and the implementation of a lot of fault-tolerant distributed algorithms that constitute the core of distributed system middleware.The literature on this topic is mostly technical and appears mainly in theoretically inclined journals and conferences. The aim of this paper is to offer an introductory survey to the failure detector concept for readers who are not familiar with it and want to quickly understand its aim, its basic principles, its power and limitations. To attain this goal, the paper first describes the motivations that underlie the concept, and then surveys several distributed computing problems showing how they can be solved with the help of an appropriate failure detector. So, this short paper presents motivations, concepts, problems, definitions, and algorithms. It does not contain proofs. It is aimed at people who want to understand basics of failure detectors.