The grid: blueprint for a new computing infrastructure
The grid: blueprint for a new computing infrastructure
The grid
Harness: a next generation distributed virtual machine
Future Generation Computer Systems - Special issue on metacomputing
Scalable networked information processing environment (SNIPE)
Future Generation Computer Systems - Special issue on metacomputing
A grid-enabled MPI: message passing in heterogeneous distributed computing systems
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
HARNESS and fault tolerant MPI
Parallel Computing - Clusters and computational grids for scientific computing
MPI-The Complete Reference, Volume 1: The MPI Core
MPI-The Complete Reference, Volume 1: The MPI Core
Using MPI-2: Advanced Features of the Message Passing Interface
Using MPI-2: Advanced Features of the Message Passing Interface
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
A Proposal for a Set of Parallel Basic Linear Algebra Subprograms
A Proposal for a Set of Parallel Basic Linear Algebra Subprograms
The GrADS Project: Software Support for High-Level Grid Application Development
International Journal of High Performance Computing Applications
Numerical Libraries and the Grid
International Journal of High Performance Computing Applications
Fault Tolerance in Message Passing Interface Programs
International Journal of High Performance Computing Applications
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
HPC-Colony: services and interfaces for very large systems
ACM SIGOPS Operating Systems Review
A Scalable and Efficient Self-Organizing Failure Detector for Grid Applications
GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Towards highly available and scalable high performance clusters
Journal of Computer and System Sciences
Service-oriented operating systems: future workspaces
IEEE Wireless Communications - Special issue title on applications and support technical for mobility and enterprise services
Towards building a highly-available cluster based model for high performance computing
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Using SCTP to hide latency in MPI programs
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Proactive fault tolerance in MPI applications via task migration
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Low cost self-healing in MPI applications
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Using MPI in high-performance computing services
Proceedings of the 20th European MPI Users' Group Meeting
Hi-index | 0.00 |
In this paper we discuss the design and use of a fault-tolerant MPI (FT-MPI) that handles process failures in a way beyond that of the original MPI static process model. FTMPI allows the semantics and associated modes of failures to be explicitly controlled by an application via a modified functionality within the standard MPI 1.2 API. Given is an overview of the FT-MPI semantics, architecture design, example usage and sample applications. A short discussion is given on the consequences of designing a fault-tolerant MPI both in terms of how such an implementation handles failures at multiple levels internally as well as how existing applications can use new features while still remaining within the MPI standard.