A Multipath Fault-Tolerant Routing Method for High-Speed Interconnection Networks

  • Authors:
  • Gonzalo Zarza;Diego Lugones;Daniel Franco;Emilio Luque

  • Affiliations:
  • Computer Architecture and Operating Systems Department, University Autonoma of Barcelona, Spain;Computer Architecture and Operating Systems Department, University Autonoma of Barcelona, Spain;Computer Architecture and Operating Systems Department, University Autonoma of Barcelona, Spain;Computer Architecture and Operating Systems Department, University Autonoma of Barcelona, Spain

  • Venue:
  • Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The intensive and continuous use of high-performance computers for executing computationally intensive applications, coupled with the large number of elements that make them up, dramatically increase the likelihood of failures during their operation. The interconnection network is a critical part of high-performance computer systems that communicates and links together the processing units. Network faults have an extremely high impact because the occurrence of a single fault may prevent the correct finalization of applications. This work focuses on the problem of fault tolerance for high-speed interconnection networks by designing a fault tolerant routing method. The goal is to solve a certain number of link and node failures, considering its impact, and occurrence probability. To accomplish this task we take advantage of communication path redundancy, by means of adaptive multipath routing approaches that fulfill the four phases of fault tolerance: error detection, damage confinement, error recovery, fault treatment and continuous service. Experiments show that our method allows applications to successfully finalize their execution in the presence of several number of faults, with an average performance value of 97% with respect to the fault-free scenarios.