Self-healing network for scalable fault-tolerant runtime environments

Authors:
Thara Angskun;Graham Fagg;George Bosilca;Jelena Pješivac-Grbović;Jack Dongarra
Affiliations:
Department of Computer Science, The University of Tennessee, 1122 Volunteer Blvd. Knoxville, TN 37996, USA;Department of Computer Science, The University of Tennessee, 1122 Volunteer Blvd. Knoxville, TN 37996, USA;Department of Computer Science, The University of Tennessee, 1122 Volunteer Blvd. Knoxville, TN 37996, USA;Department of Computer Science, The University of Tennessee, 1122 Volunteer Blvd. Knoxville, TN 37996, USA;Department of Computer Science, The University of Tennessee, 1122 Volunteer Blvd. Knoxville, TN 37996, USA
Venue:
Future Generation Computer Systems
Year:
2010

Citing 12
Cited 2

A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
The Model Checker SPIN

IEEE Transactions on Software Engineering - Special issue on formal methods in software practice
Harness: a next generation distributed virtual machine

Future Generation Computer Systems - Special issue on metacomputing
Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems

Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
Scalable Fault-Tolerant Aggregation in Large Process Groups

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
HARNESS fault tolerant MPI design, usage and performance issues

Future Generation Computer Systems - Grid computing: Towards a new computing infrastructure
A Gossip-Style Failure Detection Service

A Gossip-Style Failure Detection Service
Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and

Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and
MPI: A Message-Passing Interface Standard

MPI: A Message-Passing Interface Standard
Discrete-Event Simulation: A First Course

Discrete-Event Simulation: A First Course
The open run-time environment (OpenRTE): a transparent multi-cluster environment for high-performance computing

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

Fault tolerance logical network properties of irregular graphs

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Efficient distributed lifetime optimization algorithm for sensor networks

Ad Hoc Networks

Quantified Score

Hi-index	0.01

Visualization

Abstract

The number of processors embedded on high performance computing platforms is growing daily to satisfy the user desire for solving larger and more complex problems. Scalable and fault-tolerant runtime environments are needed to support and adapt to the underlying libraries and hardware which require a high degree of scalability in dynamic large-scale environments. This paper presents a self-healing network (SHN) for supporting scalable and fault-tolerant runtime environments. The SHN is designed to support transmission of messages across multiple nodes while also protecting against recursive node and process failures. It will automatically recover itself after a failure occurs. SHN is implemented on top of a scalable fault-tolerant protocol (SFTP). The experimental results show that both the latest multicast and broadcast routing algorithms used in SHN are faster and more reliable than the original SFTP routing algorithms.