Timely Failure Detection in a Large Distributed Real-Time System

  • Authors:
  • Tony P. Ng;Vikram N. Pate1

  • Affiliations:
  • -;-

  • Venue:
  • WORDS '94 Proceedings of the 1st Workshop on Object-Oriented Real-Time Dependable Systems
  • Year:
  • 1994

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes the experience of designing and implementing failure detection and reporting in a large distributed real-time system used for air trafic control (ATC). We believe that systematic analysis is needed to guide the failure detection design and track the large number of failures that it deals with. Analysis such as how fast failures have to be detected should be performed carefully to avoid redesigns later. A comprehensive analysis also provides a basis for testing the design subsequently, during which fault injection and extended testing are needed to evaluate and debug the design. Failure detectors should detect specific failures so that appropriate reports and recovery actions can be initiated after detection.