A new adaptive fault-tolerant routing methodology for direct networks

  • Authors:
  • M. E. Gómez;J. Duato;J. Flich;P. López;A. Robles;N. A. Nordbotten;T. Skeie;O. Lysne

  • Affiliations:
  • Dept of Computer Engineering, Universidad Politécnica de Valencia, Valencia, Spain;Dept of Computer Engineering, Universidad Politécnica de Valencia, Valencia, Spain;Dept of Computer Engineering, Universidad Politécnica de Valencia, Valencia, Spain;Dept of Computer Engineering, Universidad Politécnica de Valencia, Valencia, Spain;Dept of Computer Engineering, Universidad Politécnica de Valencia, Valencia, Spain;Simula Research Laboratory, Lysaker, Norway;Simula Research Laboratory, Lysaker, Norway;Simula Research Laboratory, Lysaker, Norway

  • Venue:
  • HiPC'04 Proceedings of the 11th international conference on High Performance Computing
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Interconnection networks play a key role in the fault tolerance of massively parallel computers, since faults may isolate a large fraction of the machine containing many healthy nodes In this paper, we present a methodology to design fully adaptive fault-tolerant routing algorithms for direct interconnection networks that can be applied to different regular topologies The methodology is mainly based on the selection of an intermediate node (if needed) for each source-destination pair Packets are adaptively routed to the intermediate node and, from this node, they are adaptively forwarded to their destination This methodology requires only one additional virtual channel, even for tori Evaluation results show that the methodology is 7-fault tolerant, and for up to 14 faults, more than 99% of the combinations are tolerated, also without significantly degrading performance in the presence of faults.