A New Approach to Fault-Tolerant Wormhole Routing for Mesh-Connected Parallel Computers

Authors:
Ching-Tien Ho;Larry J. Stockmeyer
Affiliations:
-;-
Venue:
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Year:
2002

Citing 8
Cited 3

Deadlock-Free Message Routing in Multiprocessor Interconnection Networks

IEEE Transactions on Computers
Approximation algorithms for NP-hard problems

Approximation algorithms for NP-hard problems
A Fault-Tolerant Routing Scheme for Meshes with Nonconvex Faults

IEEE Transactions on Parallel and Distributed Systems
Demonstrating the scalability of a molecular dynamics application on a Petaflop computer

ICS '01 Proceedings of the 15th international conference on Supercomputing
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Fault-Tolerant Wormhole Routing Algorithms for Mesh Networks

IEEE Transactions on Computers
Communication in Multicomputers with Nonconvex Faults

IEEE Transactions on Computers
Blue Gene: a vision for protein science using a petaflop supercomputer

IBM Systems Journal - Deep computing for the life sciences

A Fault-Tolerant and Deadlock-Free Routing Protocol in 2D Meshes Based on Odd-Even Turn Model

IEEE Transactions on Computers
Fault-Tolerant Routing Algorithm in Meshes with Solid Faults

The Journal of Supercomputing
A new adaptive fault-tolerant routing methodology for direct networks

HiPC'04 Proceedings of the 11th international conference on High Performance Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A new method for fault-tolerant routing in arbitrary dimensional meshes is introduced. The method was motivated by certain routing requirements of an initial design of the Blue Gene supercomputer project currently underway in IBM Research. Among the requirements were to provide deterministic deadlock-free wormhole routing in a 3-dimensional mesh, in the presence of many faults (up to a few percent of the many thousands of nodes in the machine), while using two virtual channels. It was also desired to minimize the number of "turns" in each route, i.e., the number of times that the route changes direction. There has been much work on routing methods for meshes that route messages around faults or regions of faults. The new method is to declare certain good nodes to be "lambs"; a lamb is used for routing but not processing, so a lamb is neither the source nor the destination of a message. The lambs are chosen so that every "survivor node", a node that is neither faulty nor a lamb, can reach every survivor node by at most two rounds of dimension-ordered (such as e-cube) routing. An algorithm for finding a set of lambs is presented. The results of simulations on 2D and 3D meshes of various sizes with various numbers of random node faults are given. For example, on a 32脳32脳32 3D mesh with 3% random faults, and using two rounds of e-cube routing for each message, the average number of lambs is less than 68, which is less than 7% of the number 983 of faults. The computational complexity of finding the minimum number of lambs for a given fault set is also explored, and this problem is shown to be NP-hard for 3-dimensional meshes with two rounds of e-cube routing.