Using MPI (2nd ed.): portable parallel programming with the message-passing interface
Using MPI (2nd ed.): portable parallel programming with the message-passing interface
Components and interfaces of a process management system for parallel programs
Parallel Computing - Clusters and computational grids for scientific computing
HARNESS and fault tolerant MPI
Parallel Computing - Clusters and computational grids for scientific computing
MPI-The Complete Reference, Volume 1: The MPI Core
MPI-The Complete Reference, Volume 1: The MPI Core
Low-Latency, Concurrent Checkpointing for Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Dynamic process management in an MPI setting
SPDP '95 Proceedings of the 7th IEEE Symposium on Parallel and Distributeed Processing
Building and Using a Fault-Tolerant MPI Implementation
International Journal of High Performance Computing Applications
Parallel Computing - Optimization on grids - Optimization for grids
Migol: A fault-tolerant service framework for MPI applications in the grid
Future Generation Computer Systems
Performance of CFD application on a grid platform between USA, China and Germany
MMACTE'05 Proceedings of the 7th WSEAS International Conference on Mathematical Methods and Computational Techniques In Electrical Engineering
Adding the easy button to the cloud with SnowFlock and MPI
Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing
Algorithm 897: VTDIRECT95: Serial and parallel codes for the global optimization algorithm direct
ACM Transactions on Mathematical Software (TOMS)
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Challenges and Issues of the Integration of RADIC into Open MPI
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Towards Efficient MapReduce Using MPI
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Implementing Reliable Data Structures for MPI Services in High Component Count Systems
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A fault-tolerant strategy for virtualized HPC clusters
The Journal of Supercomputing
A load balancing fault-tolerant algorithm for heterogeneous cluster environments
Neural, Parallel & Scientific Computations
Transparent redundant computing with MPI
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Noncollective communicator creation in MPI
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Proactive fault tolerance in MPI applications via task migration
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Novel recovery mechanism for the restoration of image contents in teleconsultation sessions
Computer Methods and Programs in Biomedicine
Tuple switching network-When slower may be better
Journal of Parallel and Distributed Computing
A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
MPI 3 and beyond: why MPI is successful and what challenges it faces
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
An evaluation of user-level failure mitigation support in MPI
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
User level failure mitigation in MPI
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Reliable scalable symbolic computation: the design of SymGridPar2
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Post-failure recovery of MPI communication capability: Design and rationale
International Journal of High Performance Computing Applications
Hi-index | 0.00 |
In this paper we examine the topic of writing fault-tolerant Message Passing Interface (MPI) applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI specification. We conclude that, within certain constraints, MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.