Reachability-Based Fault-Tolerant Routing

Authors:
J. M. Montanana;J. Flich;A. Robles;J. Duato
Affiliations:
Universidad Politécnica de Valencia, Spain;Universidad Politécnica de Valencia, Spain;Universidad Politécnica de Valencia, Spain;Universidad Politécnica de Valencia, Spain
Venue:
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Year:
2006

Citing 10
Cited 0

The turn model for adaptive routing

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
A Fault-Tolerant Routing Scheme for Meshes with Nonconvex Faults

IEEE Transactions on Parallel and Distributed Systems
Fault-tolerant adaptive routing for two-dimensional meshes

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
A New Approach to Fault-Tolerant Wormhole Routing for Mesh-Connected Parallel Computers

IEEE Transactions on Computers
Immunet: A Cheap and Robust Fault-Tolerant Packet Routing Mechanism

Proceedings of the 31st annual international symposium on Computer architecture
An Effective Methodology to Improve the Performance of the Up*/Down* Routing Algorithm

IEEE Transactions on Parallel and Distributed Systems
LASH-TOR: A Generic Transition-Oriented Routing Algorithm

ICPADS '04 Proceedings of the Parallel and Distributed Systems, Tenth International Conference
A Routing Methodology for Achieving Fault Tolerance in Direct Networks

IEEE Transactions on Computers
A scalable methodology for computing fault-free paths in InfiniBand torus networks

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Simple deadlock-free dynamic network reconfiguration

HiPC'04 Proceedings of the 11th international conference on High Performance Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Currently, clusters of PCs are being used as a costeffective alternative to large parallel computers. In most of them it is critical to keep the system running even in the presence of faults. As the number of nodes increases in these systems, the interconnection network grows accordingly. Along with the increase in components the probability of faults increases dramatically, and thus, fault-tolerance in the system, in general, and in the interconnection network, in particular, plays a key role. An interesting approach to provide fault-tolerance consists of migrating on fly the paths affected by the failure to new fault-free paths. In this paper, we propose a simple and effective faulttolerant routing methodology, referred to as Reachability Based Fault Tolerant Routing (RFTR), that can be applied to any topology. RFTR builds new alternative paths by joining subpaths extracted from the set of already computed paths, thus being time-efficient. In order to avoid deadlocks, RFTR performs, if required, a virtual channel transition on the subpath union. As an example of applicability, in this paper we apply RFTR to InfiniBand. Evaluation results on tori show that RFTR exhibits a low computation cost and does not degrade performance significantly.