Replication-Based Fault Tolerance for MPI Applications

Authors:
John Paul Walters;Vipin Chaudhary
Affiliations:
University at Buffalo, Buffalo;University at Buffalo, Buffalo
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2009

Citing 0
Cited 4

A fault-tolerant strategy for virtualized HPC clusters

The Journal of Supercomputing
Enabling replication in the ASSISTANT programming model

Proceedings of the 6th International Wireless Communications and Mobile Computing Conference
A demand based fault tolerant file replication model for clouds

Proceedings of the CUBE International Information Technology Conference
Fault tolerance using lower fidelity data in adaptive mesh applications

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale

Quantified Score

Hi-index	0.00

Visualization

Abstract

As computational clusters increase in size, their mean time to failure reduces drastically. Typically, checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require central storage for storing checkpoints. This results in a bottleneck and severely limits the scalability of checkpointing, while also proving to be too expensive for dedicated checkpointing networks and storage systems. We propose a scalable replication-based MPI checkpointing facility. Our reference implementation is based on LAM/MPI; however, it is directly applicable to any MPI implementation. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, a Sun-X4500-based solution, an EMC storage area network (SAN), and the Ibrix commercial parallel file system and show that they are not scalable, particularly after 64 CPUs. We demonstrate the low overhead of our checkpointing and replication scheme with the NAS Parallel Benchmarks and the High-Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that checkpointing and replication can be achieved with a much lower overhead than that provided by current techniques. Finally, we show that the monetary cost of our solution is as low as 25 percent of that of a typical SAN/parallel-file-system-equipped storage system.