A Family of Fault-Tolerant Routing Protocols for Direct Multiprocessor Networks
IEEE Transactions on Parallel and Distributed Systems
Configurable flow control mechanisms for fault-tolerant routing
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
IEEE Transactions on Parallel and Distributed Systems
Software-Based Rerouting for Fault-Tolerant Pipelined Communication
IEEE Transactions on Parallel and Distributed Systems
A Protocol for Deadlock-Free Dynamic Reconfiguration in High-Speed Local Area Networks
IEEE Transactions on Parallel and Distributed Systems
A Fault-Tolerant Routing Scheme for Meshes with Nonconvex Faults
IEEE Transactions on Parallel and Distributed Systems
Fault-Tolerant Wormhole Routing Algorithms for Mesh Networks
IEEE Transactions on Computers
Communication in Multicomputers with Nonconvex Faults
IEEE Transactions on Computers
Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels
IEEE Transactions on Parallel and Distributed Systems
A Flexible Routing Scheme for Networks of Workstations
ISHPC '00 Proceedings of the Third International Symposium on High Performance Computing
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Fault-tolerant adaptive routing for two-dimensional meshes
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
A Low Cost Fault Tolerant Packet Routing for Parallel Computers
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Adaptive Bubble Router: A Design to Improve Performance in Torus Networks
ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing
A Flow Control Mechanism to Avoid Message Deadlock in k-ary n-cube Networks
HIPC '97 Proceedings of the Fourth International Conference on High-Performance Computing
Deadlock-Free Dynamic Reconfiguration Schemes for Increased Network Dependability
IEEE Transactions on Parallel and Distributed Systems
A Fault-Tolerant and Deadlock-Free Routing Protocol in 2D Meshes Based on Odd-Even Turn Model
IEEE Transactions on Computers
A New Approach to Fault-Tolerant Wormhole Routing for Mesh-Connected Parallel Computers
IEEE Transactions on Computers
Immunet: A Cheap and Robust Fault-Tolerant Packet Routing Mechanism
Proceedings of the 31st annual international symposium on Computer architecture
Multi-phase minimal fault-tolerant wormhole routing in meshes
Parallel Computing
An Effective Fault-Tolerant Routing Methodology for Direct Networks
ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
A New Fault Information Model for Fault-Tolerant Adaptive and Minimal Routing in 3-D Meshes
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
An Efficient Fault-Tolerant Routing Methodology for Meshes and Tori
IEEE Computer Architecture Letters
Overview of the Blue Gene/L system architecture
IBM Journal of Research and Development
A new adaptive fault-tolerant routing methodology for direct networks
HiPC'04 Proceedings of the 11th international conference on High Performance Computing
Simple deadlock-free dynamic network reconfiguration
HiPC'04 Proceedings of the 11th international conference on High Performance Computing
Reachability-Based Fault-Tolerant Routing
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Immucube: Scalable Fault-Tolerant Routing for k-ary n-cube Networks
IEEE Transactions on Parallel and Distributed Systems
Performance analysis of fault-tolerant routing algorithm in wormhole-switched interconnections
The Journal of Supercomputing
A unified fault-tolerant routing scheme for a class of cluster networks
Journal of Systems Architecture: the EUROMICRO Journal
ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
An Analysis for Fault-Tolerant 3D Processor Arrays Using 1.5-Track Switches
IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences
A Novel Performance Index for Characterizing Stochastic Faulty Patterns in Mesh-Based Networks
ICDCN '09 Proceedings of the 10th International Conference on Distributed Computing and Networking
Understanding the interconnection network of SpiNNaker
Proceedings of the 23rd international conference on Supercomputing
A Multipath Fault-Tolerant Routing Method for High-Speed Interconnection Networks
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
A fault-tolerant communication scheme for regular cluster networks
CIIT '07 The Sixth IASTED International Conference on Communications, Internet, and Information Technology
A routing methodology for dynamic fault tolerance in meshes and tori
HiPC'07 Proceedings of the 14th international conference on High performance computing
On disconnection node failure and stochastic static resilience of P2P communication networks
ICCSA'07 Proceedings of the 2007 international conference on Computational science and its applications - Volume Part III
Addressing Manufacturing Challenges with Cost-Efficient Fault Tolerant Routing
NOCS '10 Proceedings of the 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip
Transient and Permanent Error Co-management Method for Reliable Networks-on-Chip
NOCS '10 Proceedings of the 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip
iFDOR: dynamic rerouting on-chip
Proceedings of the Fifth International Workshop on Interconnection Network Architecture: On-Chip, Multi-Chip
Topology Agnostic Dynamic Quick Reconfiguration for Large-Scale Interconnection Networks
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
An efficient routing methodology to tolerate static and dynamic faults in 2-D mesh networks-on-chip
Microprocessors & Microsystems
CamCubeOS: a key-based network stack for 3D torus cluster topologies
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
A complete self-testing and self-configuring NoC infrastructure for cost-effective MPSoCs
ACM Transactions on Embedded Computing Systems (TECS) - Special Section on Wireless Health Systems, On-Chip and Off-Chip Network Architectures
Enabling power efficiency through dynamic rerouting on-chip
ACM Transactions on Embedded Computing Systems (TECS) - Special Section on Wireless Health Systems, On-Chip and Off-Chip Network Architectures
Hi-index | 14.98 |
Massively parallel computing systems are being built with thousands of nodes. The interconnection network plays a key role for the performance of such systems. However, the high number of components significantly increases the probability of failure. Additionally, failures in the interconnection network may isolate a large fraction of the machine. It is therefore critical to provide an efficient fault-tolerant mechanism to keep the system running, even in the presence of faults. This paper presents a new fault-tolerant routing methodology that does not degrade performance in the absence of faults and tolerates a reasonably large number of faults without disabling any healthy node. In order to avoid faults, for some source-destination pairs, packets are first sent to an intermediate node and then from this node to the destination node. Fully adaptive routing is used along both subpaths. The methodology assumes a static fault model and the use of a checkpoint/restart mechanism. However, there are scenarios where the faults cannot be avoided solely by using an intermediate node. Thus, we also provide some extensions to the methodology. Specifically, we propose disabling adaptive routing and/or using misrouting on a per-packet basis. We also propose the use of more than one intermediate node for some paths. The proposed fault-tolerant routing methodology is extensively evaluated in terms of fault tolerance, complexity, and performance.