A scalable asynchronous replication-based strategy for fault tolerant MPI applications

Authors:
John Paul Walters;Vipin Chaudhary
Affiliations:
Department of Computer Science and Engineering, University at Buffalo, The State University of New York, Buffalo, NY;Department of Computer Science and Engineering, University at Buffalo, The State University of New York, Buffalo, NY
Venue:
HiPC'07 Proceedings of the 14th international conference on High performance computing
Year:
2007

Citing 10
Cited 2

Search and replication in unstructured peer-to-peer networks

ICS '02 Proceedings of the 16th international conference on Supercomputing
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Collective operations in application-level fault-tolerant MPI

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
User-level checkpoint and recovery for LAM/MPI

ACM SIGOPS Operating Systems Review
Process Migration for MPI Applications based on Coordinated Checkpoint

ICPADS '05 Proceedings of the 11th International Conference on Parallel and Distributed Systems - Volume 01
Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3)

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Proactive fault tolerance in MPI applications via task migration

HiPC'06 Proceedings of the 13th international conference on High Performance Computing

A fault-tolerant strategy for virtualized HPC clusters

The Journal of Supercomputing
Fault-management in P2P-MPI

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require a central storage for storing checkpoints. This severely limits the scalability of checkpointing. We propose a scalable replication-based MPI checkpointing facility that is based on LAM/MPI. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, SAN-based solutions, and a commercial parallel file system, and show that they are not scalable, particularly beyond 64 CPUs.We demonstrate the low overhead of our replication scheme with the NAS Parallel Benchmarks and the High Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that checkpointing and replication can be achieved with much lower overhead than that provided by current techniques.