IEEE Transactions on Software Engineering - Special issue on formal methods in software practice
Harness: a next generation distributed virtual machine
Future Generation Computer Systems - Special issue on metacomputing
Chord: A scalable peer-to-peer lookup service for internet applications
Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems
Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
Scalable Fault-Tolerant Aggregation in Large Process Groups
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
HARNESS fault tolerant MPI design, usage and performance issues
Future Generation Computer Systems - Grid computing: Towards a new computing infrastructure
A Gossip-Style Failure Detection Service
A Gossip-Style Failure Detection Service
Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and
Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and
MPI: A Message-Passing Interface Standard
MPI: A Message-Passing Interface Standard
Discrete-Event Simulation: A First Course
Discrete-Event Simulation: A First Course
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Fault tolerance logical network properties of irregular graphs
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Hi-index | 0.01 |
The number of processors embedded on high performance computing platforms is growing daily to satisfy the user desire for solving larger and more complex problems. Scalable and fault-tolerant runtime environments are needed to support and adapt to the underlying libraries and hardware which require a high degree of scalability in dynamic large-scale environments. This paper presents a self-healing network (SHN) for supporting scalable and fault-tolerant runtime environments. The SHN is designed to support transmission of messages across multiple nodes while also protecting against recursive node and process failures. It will automatically recover itself after a failure occurs. SHN is implemented on top of a scalable fault-tolerant protocol (SFTP). The experimental results show that both the latest multicast and broadcast routing algorithms used in SHN are faster and more reliable than the original SFTP routing algorithms.