MPI: a message passing interface
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
A case for two-level distributed recovery schemes
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Staggered Consistent Checkpointing
IEEE Transactions on Parallel and Distributed Systems
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Automated application-level checkpointing of MPI programs
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
System structure for software fault tolerance
Proceedings of the international conference on Reliable software
Fault-tolerance and fault-intolerance: Complementary approaches to reliable computing
Proceedings of the international conference on Reliable software
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Libckpt: Transparent Checkpointing under Unix
Libckpt: Transparent Checkpointing under Unix
Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3)
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Performance evaluation of adaptive MPI
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++
ACM SIGOPS Operating Systems Review
Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
DejaVu: transparent user-level checkpointing, migration and recovery for distributed systems
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Performance evaluation of an application-level checkpointing solution on grids
Future Generation Computer Systems
Checkpoint/restart-enabled parallel debugging
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Virtualizing high performance computing
ACM SIGOPS Operating Systems Review
Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Alleviating scalability issues of checkpointing protocols
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Automatic resource-centric process migration for MPI
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Hi-index | 0.00 |
Long running High Performance Computing (HPC) applications at scale must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Message Passing Interface (MPI) level transparent checkpoint/restart fault tolerance is an appealing option to HPC application developers that do not wish to restructure their code. Historically, MPI implementations that provided this option have struggled to provide a full range of interconnect support, especially shared memory support. This paper presents a new approach for implementing checkpoint/restart coordination algorithms that allows the MPI implementation of checkpoint/restart to be interconnect agnostic. This approach allows an application to be checkpointed on one set of interconnects (e.g., InfiniBand and shared memory) and be restarted with a different set of interconnects (e.g., Myrinet and shared memory or Ethernet). By separating the network interconnect details from the checkpoint/restart coordination algorithm we allow the HPC application to respond to changes in the cluster environment such as interconnect unavailability due to switch failure, re-load balance on an existing machine, or migrate to a different machine with a different set of interconnects. We present results characterizing the performance impact of this approach on HPC applications.