Scalable fault tolerant protocol for parallel runtime environments

Authors:
Thara Angskun;Graham E. Fagg;George Bosilca;Jelena Pješivac–Grbović;Jack J. Dongarra
Affiliations:
Dept. of Computer Science, The University of Tennessee, Knoxville, TN;Dept. of Computer Science, The University of Tennessee, Knoxville, TN;Dept. of Computer Science, The University of Tennessee, Knoxville, TN;Dept. of Computer Science, The University of Tennessee, Knoxville, TN;Dept. of Computer Science, The University of Tennessee, Knoxville, TN
Venue:
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Year:
2006

Citing 13
Cited 7

Design and validation of computer protocols

Design and validation of computer protocols
A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
The Model Checker SPIN

IEEE Transactions on Software Engineering - Special issue on formal methods in software practice
Harness: a next generation distributed virtual machine

Future Generation Computer Systems - Special issue on metacomputing
Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems

Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
Scalable Fault-Tolerant Aggregation in Large Process Groups

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
A Scalable Process-Management Environment for Parallel Programs

Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A Gossip-Style Failure Detection Service

A Gossip-Style Failure Detection Service
Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and

Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and
MPI: A Message-Passing Interface Standard

MPI: A Message-Passing Interface Standard
A scalable content-addressable network

A scalable content-addressable network
The open run-time environment (OpenRTE): a transparent multi-cluster environment for high-performance computing

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Self-healing in binomial graph networks

OTM'07 Proceedings of the 2007 OTM Confederated international conference on On the move to meaningful internet systems - Volume Part II
Modeling resubmission in unreliable grids: the bottom-up approach

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Binomial graph: a scalable and fault-tolerant logical network topology

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Fault tolerance logical network properties of irregular graphs

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Abstractions and Middleware for Petascale Computing and Beyond

International Journal of Distributed Systems and Technologies
Distributed Throughput Optimization for Large-Scale Scientific Workflows Under Fault-Tolerance Constraint

Journal of Grid Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The number of processors embedded on high performance computing platforms is growing daily to satisfy users desire for solving larger and more complex problems. Parallel runtime environments have to support and adapt to the underlying libraries and hardware which require a high degree of scalability in dynamic environments. This paper presents the design of a scalable and fault tolerant protocol for supporting parallel runtime environment communications. The protocol is designed to support transmission of messages across multiple nodes with in a self-healing topology to protect against recursive node and process failures. A formal protocol verification has validated the protocol for both the normal and failure cases. We have implemented multiple routing algorithms for the protocol and concluded that the variant rule-based routing algorithm yields the best overall results for damaged and incomplete topologies .