Load Redistribution Under Failure in Distributed Systems

  • Authors:
  • T. C. K. Chou;J. A. Abraham

  • Affiliations:
  • Tandem Computers;-

  • Venue:
  • IEEE Transactions on Computers
  • Year:
  • 1983

Quantified Score

Hi-index 14.98

Visualization

Abstract

In order to implement a distributed system with fail-soft capabilities it is necessary to specify algorithms which redistribute the work load of a failed processor to the remaining good processors. This paper develops a general model to analyze the behavior of these algorithms in a distributed system. Such algorithms should be used with caution as they have the capability of making the entire system Unstable. By unstable we mean that if a processor fails, and its workload is redistributed, then the increased workload directed towards the rest of the system could drive one or more of the processors into overload resulting in a serious degradation of system performance. Using the general model we have studied a class of load redistribution algorithms which use various techniques to redistribute workload. These techniques include: buffering jobs arriving to the failed processor, transmitting only the jobs in the queue of the failed processor, and rerouting all jobs around the failed processor. For this class of algorithms we have derived closed form expressions for the performance of the system as a function of job arrival rate, job service rate, processor failure rate, and processor service rate. In addition, we have defined a criterion which, if adhered to, will guarantee system stability in the event of failure.