On switching policies for modular redundancy fault-tolerant computing systems

  • Authors:
  • M. Berg;I. Koren

  • Affiliations:
  • -;-

  • Venue:
  • IEEE Transactions on Computers
  • Year:
  • 1987

Quantified Score

Hi-index 15.00

Visualization

Abstract

The objective of fault-tolerant computing systems is to provide an error-free operation in the presence of faults. The system has to recover from the effects of a fault by employing certain recovery procedures like program rollback, reload, and restart, etc. However, these recovery procedures, result in interruptions in the system's operation, thus reducing the availability of the system for user applications. Fault-tolerant systems for critical applications include, therefore, standby spares that are ready to replace active modules which fail to recover from the effects of a fault. A standby spare may also be used to replace a module suffering from frequent fault occurrences resulting in too many repetitions of the recovery process, in order to increase the availability of the system for user applications. In this case a module switching policy is needed indicating upon a fault occurrence, whether to retry a failing module or switch it out and replace it by a spare, considering the remaining mission time and the probability of a system crash. A module switching policy for dynamic redundancy systems is presented in this paper and the improvement in application-oriented availability due to the use of this policy is illustrated.