Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
High-throughput resource management
The grid
CLIP: a checkpointing tool for message-passing parallel programs
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Managing Checkpoints for Parallel Programs
IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
MPICH-CM: A Communication Library Design for a P2P MPI Implementation
Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
XtremWeb: A Generic Global Computing System
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Migol: A fault-tolerant service framework for MPI applications in the grid
Future Generation Computer Systems
Hi-index | 0.00 |
Fault tolerant message passing environments protect parallel applications against node failures. Very large scale computing systems, ranging from large clusters to worldwide Global Computing systems, require a high level of fault tolerance in order to efficiently run parallel applications. The Channel Memory approach provides the infrastructure for scalable tolerance to simultaneous faults. Along with a specially designed checkpointing system and recovery protocol, this approach has resulted in the MPICH-V architecture. In this paper, we describe CMDE - a stand-alone distributed program system based on MPICH-V architecture and implementing an approach to tolerate faults of Channel Memories.