Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Authors:
Qi Gao;Weikuan Yu;Wei Huang;Dhabaleswar K. Panda
Affiliations:
The Ohio State University, USA;The Ohio State University, USA;The Ohio State University, USA;The Ohio State University, USA
Venue:
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Year:
2006

Citing 0
Cited 16

A serialization based approach for strong mobility of shared object

Proceedings of the 5th international symposium on Principles and practice of programming in Java
CprFS: a user-level file system to support consistent file states for checkpoint and restart

Proceedings of the 22nd annual international conference on Supercomputing
Adapting a message-driven parallel application to GPU-accelerated clusters

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Scalable transparent checkpoint-restart of global address space applications on virtual machines over infiniband

Proceedings of the 6th ACM conference on Computing frontiers
Interconnect agnostic checkpoint/restart in open MPI

Proceedings of the 18th ACM international symposium on High performance distributed computing
A serialisation based approach for processes strong mobility

DAIS'07 Proceedings of the 7th IFIP WG 6.1 international conference on Distributed applications and interoperable systems
A scalable asynchronous replication-based strategy for fault tolerant MPI applications

HiPC'07 Proceedings of the 14th international conference on High performance computing
Selective Recovery from Failures in a Task Parallel Programming Model

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Checkpoint/restart-enabled parallel debugging

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
An effective speedup metric for measuring productivity in large-scale parallel computer systems

The Journal of Supercomputing
Application-Level checkpointing techniques for parallel programs

ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
Can checkpoint/restart mechanisms benefit from hierarchical data staging?

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Impact of over-decomposition on coordinated checkpoint/rollback protocol

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Practical model-checking method for verifying correctness of MPI programs

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Automatic resource-centric process migration for MPI

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
X10-FT: Transparent fault tolerance for APGAS language and runtime

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Ultra-scale computer clusters with high speed interconnects, such as InfiniBand, are being widely deployed for their excellent performance and cost effectiveness. However, the failure rate on these clusters also increases along with their augmented number of components. Thus, it becomes critical for such systems to be equipped with fault tolerance support. In this paper, we present our design and implementation of checkpoint/restart framework for MPI programs running over InfiniBand clusters. Our design enables low-overhead, application-transparent checkpointing. It uses coordinated protocol to save the current state of the whole MPI job to reliable storage, which allows users to perform rollback recovery if the system runs into faulty states later. Our solution has been incorporated into MVAPICH2, an open-source high performance MPI-2 implementation over InfiniBand. Performance evaluation of this implementation has been carried out using NAS benchmarks, HPL benchmark, and a real-world application called GROMACS. Experimental results indicate that in our design, the overhead to take checkpoints is low, and the performance impact for checkpointing applications periodically is insignificant. For example, time for checkpointing GROMACS is less than 0.3% of the execution time, and its performance only decreases by 4% with checkpoints taken every minute. To the best of our knowledge, this work is the first report of checkpoint/restart support for MPI over InfiniBand clusters in the literature.