Load Redistribution Under Failure in Distributed Systems

Authors:
T. C. K. Chou;J. A. Abraham
Affiliations:
Tandem Computers;-
Venue:
IEEE Transactions on Computers
Year:
1983

Citing 8
Cited 0

Application of the Diffusion Approximation to Queueing Networks I: Equilibrium Queue Distributions

Journal of the ACM (JACM)
Open, Closed, and Mixed Networks of Queues with Different Classes of Customers

Journal of the ACM (JACM)
Product Form and Local Balance in Queueing Networks

Journal of the ACM (JACM)
A Comparative Study of Some Two-Processor Organizations

IEEE Transactions on Computers
Models for Dynamic Load Balancing in a Heterogeneous Multiple Processor System

IEEE Transactions on Computers
Performance-Related Reliability Measures for Computing Systems

IEEE Transactions on Computers
Performability Evaluation of the SIFT Computer

IEEE Transactions on Computers
Approximate analysis of general queuing networks

IBM Journal of Research and Development

Quantified Score

Hi-index	14.98

Visualization

Abstract

In order to implement a distributed system with fail-soft capabilities it is necessary to specify algorithms which redistribute the work load of a failed processor to the remaining good processors. This paper develops a general model to analyze the behavior of these algorithms in a distributed system. Such algorithms should be used with caution as they have the capability of making the entire system Unstable. By unstable we mean that if a processor fails, and its workload is redistributed, then the increased workload directed towards the rest of the system could drive one or more of the processors into overload resulting in a serious degradation of system performance. Using the general model we have studied a class of load redistribution algorithms which use various techniques to redistribute workload. These techniques include: buffering jobs arriving to the failed processor, transmitting only the jobs in the queue of the failed processor, and rerouting all jobs around the failed processor. For this class of algorithms we have derived closed form expressions for the performance of the system as a function of job arrival rate, job service rate, processor failure rate, and processor service rate. In addition, we have defined a criterion which, if adhered to, will guarantee system stability in the event of failure.