Fundamentals of fault-tolerant distributed computing in asynchronous environments

  • Authors:
  • Felix C. Gärtner

  • Affiliations:
  • Darmstadt Univ. of Technology, Darmstadt, Germany

  • Venue:
  • ACM Computing Surveys (CSUR)
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

Fault tolerance in distributed computing is a wide area with a significant body of literature that is vastly diverse in methodology and terminology. This paper aims at structuring the area and thus guiding readers into this interesting field. We use a formal approach to define important terms like fault, fault tolerance, and redundancy. This leads to four distinct forms of fault tolerance and to two main phases in achieving them: detection and correction. We show that this can help to reveal inherently fundamental structures that contribute to understanding and unifying methods and terminology. By doing this, we survey many existing methodologies and discuss their relations. The underlying system model is the close-to-reality asynchronous message-passing model of distributed computing.