On the availability of a distributed computer system with failing components

  • Authors:
  • Erol Gelenbe;David Finkel;Satish K. Tripathi

  • Affiliations:
  • ISEM, Université de Paris-Sud, 91405 Orsay, France;Department of Mathematics, Bucknell University, Lewisburg, PA;Systems Design and Analysis Group, Department of Computer Science, University of Maryland, College Park, MD

  • Venue:
  • SIGMETRICS '85 Proceedings of the 1985 ACM SIGMETRICS conference on Measurement and modeling of computer systems
  • Year:
  • 1985

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a model for distributed systems with failing components. Each node may fail and during its recovery the load is distributed to other nodes that are operational. The model assumes periodic checkpointing for error recovery and testing of the status of other nodes for the distribution of load.We consider the availability of a node, which is the proportion of time a node is available for processing, as the performance measure. A methodology for optimizing the availability of a node with respect to the checkpointing and testing intervals is given. A decomposition approach that uses the steady-state flow balance condition to estimate the load at a node is proposed. Numerical examples are presented to demonstrate the usefulness of the technique. For the case in which all nodes are identical, closed form solutions are obtained.