Fault tolerance in distributed systems
Fault tolerance in distributed systems
A Theory of Fault-Tolerant Routing in Wormhole Networks
IEEE Transactions on Parallel and Distributed Systems
Dynamically Configurable Message Flow Control for Fault-Tolerant Routing
IEEE Transactions on Parallel and Distributed Systems
Interconnection Networks: An Engineering Approach
Interconnection Networks: An Engineering Approach
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Universal schemes for parallel communication
STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Distributed Routing Balancing for Interconnection Network Communication
HIPC '98 Proceedings of the Fifth International Conference on High Performance Computing
A New Approach to Fault-Tolerant Wormhole Routing for Mesh-Connected Parallel Computers
IEEE Transactions on Computers
Siamese-Twin: A Dynamically Fault-Tolerant Fat-Tree
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
A Routing Methodology for Achieving Fault Tolerance in Direct Networks
IEEE Transactions on Computers
Design And Analysis of Reliable And Fault-tolerant Computer Systems
Design And Analysis of Reliable And Fault-tolerant Computer Systems
Immucube: Scalable Fault-Tolerant Routing for k-ary n-cube Networks
IEEE Transactions on Parallel and Distributed Systems
Dynamic and Distributed Multipath Routing Policy for High-Speed Cluster Networks
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
A routing methodology for dynamic fault tolerance in meshes and tori
HiPC'07 Proceedings of the 14th international conference on High performance computing
A scalable methodology for computing fault-free paths in InfiniBand torus networks
ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Segment-based routing: an efficient fault-tolerant routing algorithm for meshes and Tori
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
An efficient fault-tolerant routing methodology for fat-tree interconnection networks
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Hi-index | 0.00 |
The intensive and continuous use of high-performance computers for executing computationally intensive applications, coupled with the large number of elements that make them up, dramatically increase the likelihood of failures during their operation. The interconnection network is a critical part of high-performance computer systems that communicates and links together the processing units. Network faults have an extremely high impact because the occurrence of a single fault may prevent the correct finalization of applications. This work focuses on the problem of fault tolerance for high-speed interconnection networks by designing a fault tolerant routing method. The goal is to solve a certain number of link and node failures, considering its impact, and occurrence probability. To accomplish this task we take advantage of communication path redundancy, by means of adaptive multipath routing approaches that fulfill the four phases of fault tolerance: error detection, damage confinement, error recovery, fault treatment and continuous service. Experiments show that our method allows applications to successfully finalize their execution in the presence of several number of faults, with an average performance value of 97% with respect to the fault-free scenarios.