Optimal recovery schemes in fault tolerant distributed computing

  • Authors:
  • Kamilla Klonowska;Håkan Lennerstad;Lars Lundberg;Charlie Svahnberg

  • Affiliations:
  • Blekinge Institute of Technology, School of Engineering, 372 25, Ronneby, Sweden;Blekinge Institute of Technology, School of Engineering, 372 25, Ronneby, Sweden;Blekinge Institute of Technology, School of Engineering, 372 25, Ronneby, Sweden;Blekinge Institute of Technology, School of Engineering, 372 25, Ronneby, Sweden

  • Venue:
  • Acta Informatica
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Clusters and distributed systems offer fault tolerance and high performance through load sharing. When all n computers are up and running, we would like the load to be evenly distributed among the computers. When one or more computers break down, the load on these computers must be redistributed to other computers in the system. The redistribution is determined by the recovery scheme. The recovery scheme is governed by a sequence of integers modulo n. Each sequence guarantees minimal load on the computer that has maximal load even when the most unfavorable combinations of computers go down. We calculate the best possible such recovery schemes for any number of crashed computers by an exhaustive search, where brute force testing is avoided by a mathematical reformulation of the problem and a branch-and-bound algorithm. The search nevertheless has a high complexity. Optimal sequences, and thus a corresponding optimal bound, are presented for a maximum of twenty one computers in the distributed system or cluster.