On Designing and Reconfiguring k-Fault-Tolerant Tree Architectures
IEEE Transactions on Computers
Designing fault-tolerant systems using automorphisms
Journal of Parallel and Distributed Computing
Chordal rings as fault-tolerant loops
Discrete Applied Mathematics - Special double volume: interconnection networks
Some Practical Issues in the Design of Fault-Tolerant Multiprocessors
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Fault-Tolerant Meshes and Hypercubes with Minimal Numbers of Spares
IEEE Transactions on Computers
Connective Fault Tolerance in Multiple-Bus Systems
IEEE Transactions on Parallel and Distributed Systems
A Graph Model for Fault-Tolerant Computing Systems
IEEE Transactions on Computers
Computing in the RAIN: A Reliable Array of Independent Nodes
IEEE Transactions on Parallel and Distributed Systems
A Group Membership Algorithm with a Practical Specification
IEEE Transactions on Parallel and Distributed Systems
Survivable Computer Networks in the Presence of Partitioning
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Computing in the RAIN: A Reliable Array of Independent Nodes
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
A Proxy-Network Based Overlay Topology Resistant to DoS Attacks and Partitioning
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Hi-index | 0.00 |
The RAIN (Reliable Array of Independent Nodes) project at Caltech is focusing on creating highly reliable distributed systems by leveraging commercially available personal computers, workstations and interconnect technologies. In particular, the issue of reliable communication is addressed by introducing redundancy in the form of multiple network interfaces per compute node. When using compute nodes with multiple network connections the question of how to best connect these nodes to a given network of switches arises. We examine networks of switches (e.g. based on Myrinet technology) and focus on degree-two compute nodes (two network adaptor cards per node). Our primary goal is to create networks that are as resistant as possible to partitioning.Our main contributions are: (i) a construction for degree-2 compute nodes connected by a ring network of switches of degree 4 that can tolerate any 3 switch failures without partitioning the nodes into disjoint sets, (ii) a proof that this construction is optimal in the sense that no construction can tolerate more switch failures while avoiding partitioning and (iii) generalizations of this construction to arbitrary switch and node degrees and to other switch networks, in particular, to a fully-connected network of switches.