A channel memory based fault tolerance for MPI applications

Authors:
A. Selikhov;C. Germain
Affiliations:
Supercomputer Software Department, ICMMG SB RAS, pr.Lavrentieva, Novosibirsk, Russia;Université Paris-Sud, LAL (CNRS), Bâtiment, Orsay Cedex, France
Venue:
Future Generation Computer Systems - Special issue: Parallel computing technologies
Year:
2005

Citing 13
Cited 1

Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
High-throughput resource management

The grid
CLIP: a checkpointing tool for message-passing parallel programs

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Managing Checkpoints for Parallel Programs

IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
MPICH-CM: A Communication Library Design for a P2P MPI Implementation

Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
XtremWeb: A Generic Global Computing System

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
MPI/FTTM: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Proceedings of the 2003 ACM/IEEE conference on Supercomputing

Migol: A fault-tolerant service framework for MPI applications in the grid

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fault tolerant message passing environments protect parallel applications against node failures. Very large scale computing systems, ranging from large clusters to worldwide Global Computing systems, require a high level of fault tolerance in order to efficiently run parallel applications. The Channel Memory approach provides the infrastructure for scalable tolerance to simultaneous faults. Along with a specially designed checkpointing system and recovery protocol, this approach has resulted in the MPICH-V architecture. In this paper, we describe CMDE - a stand-alone distributed program system based on MPICH-V architecture and implementing an approach to tolerate faults of Channel Memories.