Maintaining quality of service with dynamic fault tolerance in fat-trees

  • Authors:
  • Frank Olaf Sem-Jacobsen;Tor Skeie

  • Affiliations:
  • Department of Informatics, University of Oslo, Oslo, Norway and Networks and Distributed Systems, Simula Research Laboratory, Lysaker, Norway;Department of Informatics, University of Oslo, Oslo, Norway and Networks and Distributed Systems, Simula Research Laboratory, Lysaker, Norway

  • Venue:
  • HiPC'08 Proceedings of the 15th international conference on High performance computing
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

A very important ingredient in the computing landscape isUtility Computing Data Centres (UCDCs), large-scale computing systemsthat offer computational services to concurrently running jobsthrough virtual servers. As UCDC systems increase in size and the meantime between failure decreases, it is becoming an increasingly importantchallenge to expediently tolerate failures (dynamically), while distributingthe effects of the failure amongst the virtual servers according to theirservice level agreements. We propose and evaluate a strategy for offeringpredictable service in fat-trees experiencing faults, by reprioritisingpackets. The strategy is able to distribute the effect of network faults inorder to satisfy a number of quality-of-service demands. Which demandsto favour depends on the computer system and the characteristics of thejobs it is running, and in the presence of a moderate number of faults itis to some degree possible to meet the demands.