A scalable methodology for computing fault-free paths in InfiniBand torus networks

Authors:
J. M. Montañana;J. Flich;A. Robles;J. Duato
Affiliations:
Dept. of Computer Engineering, DISCA, UPV, Valencia, Spain;Dept. of Computer Engineering, DISCA, UPV, Valencia, Spain;Dept. of Computer Engineering, DISCA, UPV, Valencia, Spain;Dept. of Computer Engineering, DISCA, UPV, Valencia, Spain
Venue:
ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Year:
2005

Citing 6
Cited 2

Interconnection Networks: An Engineering Approach

Interconnection Networks: An Engineering Approach
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
The Alpha 21364 Network Architecture

IEEE Micro
The Quadrics Network (QsNet): High-Performance Clustering Technology

HOTI '01 Proceedings of the The Ninth Symposium on High Performance Interconnects
Blue Gene/L torus interconnection network

IBM Journal of Research and Development
Simple deadlock-free dynamic network reconfiguration

HiPC'04 Proceedings of the 11th international conference on High Performance Computing

Reachability-Based Fault-Tolerant Routing

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
A Multipath Fault-Tolerant Routing Method for High-Speed Interconnection Networks

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Currently, clusters of PCs are considered as a cost-effective alternative to large parallel computers. In these systems the interconnection network plays a key role. As the number of elements increases in these systems, the probability of faults increases dramatically. Moreover, in some cases, it is critical to keep the system running even in the presence of faults. Therefore, an effective faulttolerant strategy is needed. InfiniBand (IBA) is a new standard interconnect suitable for clusters. Unfortunately, most of the fault-tolerant routing strategies proposed for massively parallel computers cannot be applied to IBA because routing and virtual channel transitions are deterministic, which prevent packets from avoiding the faults. A possible approach to provide fault-tolerance in IBA consists of using several disjoint paths between every source-destination pair of nodes and selecting the appropriate path at the source host. However, to this end, a routing algorithm able to provide enough disjoint paths, while still guaranteeing deadlock-freedom, is required. In this paper we address this issue, proposing a scalable fault-tolerant methodology for IBA Torus networks. Results show that the proposed methodology scales and supports up to (2n - 1)-faults for n-dimensional tori when using 2 VLs (virtual lanes) and 4 SLs (service levels) regardless of the network size. Additionally the methodology is able to support up to 3 faults for 2D tori with 2 VLs and only 3 SLs.