Deadlock-Free Message Routing in Multiprocessor Interconnection Networks
IEEE Transactions on Computers
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
Adaptive Fault-Tolerant Routing in Hypercube Multicomputers
IEEE Transactions on Computers
Chaos router: architecture and performance
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Active messages: a mechanism for integrated communication and computation
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Planar-adaptive routing: low-cost adaptive networks for multiprocessors
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The turn model for adaptive routing
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
A comparison of adaptive wormhole routing algorithms
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
An overview of Cray research computers including the Y-MP/C90 and the new MPP T3D
SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks
IEEE Transactions on Parallel and Distributed Systems
Fault-tolerant wormhole routing in tori
ICS '94 Proceedings of the 8th international conference on Supercomputing
Adaptive Deadlock- and Livelock-Free Routing with All Minimal Paths in Torus Networks
IEEE Transactions on Parallel and Distributed Systems
Ariadne—an adaptive router for fault-tolerant multicomputers
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Compressionless routing: a framework for adaptive and fault-tolerant routing
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
A Family of Fault-Tolerant Routing Protocols for Direct Multiprocessor Networks
IEEE Transactions on Parallel and Distributed Systems
Configurable flow control mechanisms for fault-tolerant routing
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Distributed, Deadlock-Free Routing in Faulty, Pipelined, Direct Interconnection Networks
IEEE Transactions on Computers
Fault-tolerant routing with non-adaptive wormhole algorithms in mesh networks
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
IEEE Transactions on Parallel and Distributed Systems
Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels
IEEE Transactions on Parallel and Distributed Systems
The Reliable Router: A Reliable and High-Performance Communication Substrate for Parallel Computers
PCRCW '94 Proceedings of the First International Workshop on Parallel Computer Routing and Communication
A Thory of Fault-Tolerant routing in Wormhole Networks
Proceedings of the 1994 International Conference on Parallel and Distributed Systems
Origin-based fault-tolerant routing in the mesh
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Fault-tolerant wormhole routing for hypercube networks
Information Processing Letters
A fault-tolerant wormhole routing scheme for torus networks with nonconvex faults
Information Processing Letters
A Routing Methodology for Achieving Fault Tolerance in Direct Networks
IEEE Transactions on Computers
Characterization of spatial fault patterns in interconnection networks
Parallel Computing
Performance analysis of fault-tolerant routing algorithm in wormhole-switched interconnections
The Journal of Supercomputing
A unified fault-tolerant routing scheme for a class of cluster networks
Journal of Systems Architecture: the EUROMICRO Journal
ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
A Novel Performance Index for Characterizing Stochastic Faulty Patterns in Mesh-Based Networks
ICDCN '09 Proceedings of the 10th International Conference on Distributed Computing and Networking
A new performance measure for characterizing fault rings in interconnection networks
Information Sciences: an International Journal
A fault-tolerant communication scheme for regular cluster networks
CIIT '07 The Sixth IASTED International Conference on Communications, Internet, and Information Technology
rHALB: a new load-balanced routing algorithm for k-ary n-cube networks
APPT'07 Proceedings of the 7th international conference on Advanced parallel processing technologies
On disconnection node failure and stochastic static resilience of P2P communication networks
ICCSA'07 Proceedings of the 2007 international conference on Computational science and its applications - Volume Part III
Software-based fault-tolerant routing algorithm in multi- dimensional networks
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Proceedings of the Third International Workshop on Network on Chip Architectures
The Journal of Supercomputing
A performance model of fault-tolerant routing algorithm in interconnect networks
ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part I
A new adaptive fault-tolerant routing methodology for direct networks
HiPC'04 Proceedings of the 11th international conference on High Performance Computing
Hi-index | 0.01 |
This paper presents a software-based approach to fault-tolerant routing in networks using wormhole or virtual cut-through switching. When a message encounters a faulty output link, it is removed from the network by the local router and delivered to the messaging layer of the local node's operating system. The message passing software can reroute this message, possibly along nonminimal paths. Alternatively, the message may be addressed to an intermediate node, which will forward the message to the destination. A message may encounter multiple faults and pass through multiple intermediate nodes. The proposed techniques are applicable to both obliviously and adaptively routed networks. The techniques are specifically targeted toward commercial multiprocessors where the mean time to repair (MTTR) is much smaller than the mean time between router failures (MTBF), i.e., it is sufficient to tolerate a maximum of three failures.This paper presents requirements for buffer management, deadlock freedom, and livelock freedom. Simulation results are presented to evaluate the degradation in latency and throughput as a function of the number and distribution of faults. There are several advantages of such an approach. Router designs are minimally impacted, and thus remain compact and fast. Only messages that encounter faulty components are affected, while the machine is ensured of continued operation until the faulty components can be replaced. The technique leverages existing network technology, and the concepts are portable across evolving switch and router designs. Therefore, we feel that the technique is a good candidate for incorporation into the next generation of multiprocessor networks.