Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs

Authors:
Martin Schulz;Greg Bronevetsky;Rohit Fernandes;Daniel Marques;Keshav Pingali;Paul Stodghill
Affiliations:
Lawrence Livermore National Laboratory;Cornell University;Cornell University;Cornell University;Cornell University;Cornell University
Venue:
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Year:
2004

Citing 14
Cited 18

Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Direct bulk-synchronous parallel algorithms

Journal of Parallel and Distributed Computing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Application level fault tolerance in heterogeneous networks of workstations

Journal of Parallel and Distributed Computing
Distributed Algorithms

Distributed Algorithms
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Collective operations in application-level fault-tolerant MPI

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Portable Checkpointing for Heterogeneous Archtitectures

FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Compiler-Assisted Checkpointing

Compiler-Assisted Checkpointing
Process Introspection: A Heterogeneous Checkpoint/Restart Mechanism Based on Automatic Code Modification

Process Introspection: A Heterogeneous Checkpoint/Restart Mechanism Based on Automatic Code Modification
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Proceedings of the 2003 ACM/IEEE conference on Supercomputing

Optimizing Checkpoint Sizes in the C3 System

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Mobile MPI programs in computational grids

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Experimental evaluation of application-level checkpointing for OpenMP programs

Proceedings of the 20th annual international conference on Supercomputing
Cooperative checkpointing: a robust approach to large-scale systems reliability

Proceedings of the 20th annual international conference on Supercomputing
Scalable algorithms for global snapshots in distributed systems

Proceedings of the 20th annual international conference on Supercomputing
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
On the Performance of Transparent MPI Piggyback Messages

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Implementing Reliable Data Structures for MPI Services in High Component Count Systems

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Supporting fault-tolerance for time-critical events in distributed environments

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A study of dynamic meta-learning for failure prediction in large-scale systems

Journal of Parallel and Distributed Computing
Supporting fault-tolerance for time-critical events in distributed environments

Scientific Programming
Performance evaluation of an application-level checkpointing solution on grids

Future Generation Computer Systems
Recent advances in checkpoint/recovery systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Understanding Checkpointing Overheads on Massive-Scale Systems: Analysis of the IBM Blue Gene/P System

International Journal of High Performance Computing Applications
Tuple switching network-When slower may be better

Journal of Parallel and Distributed Computing
Runtime message uniquification for accurate communication analysis on incomplete MPI event traces

Proceedings of the 20th European MPI Users' Group Meeting
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The running times of many computational science applications are much longer than the mean-time-to-failure of current high-performance computing platforms. To run to completion, such applications must tolerate hardware failures. Checkpoint-and-restart (CPR) is the most commonly used scheme for accomplishing this - the state of the computation is saved periodically on stable storage, and when a hardware failure is detected, the computation is restarted from the most recently saved state. Most automatic CPR schemes in the literature can be classified as system-level checkpointing schemes because they take core-dump style snapshots of the computational state when all the processes are blocked at global barriers in the program. Unfortunately, a system that implements this style of checkpointing is tied to a particular platform; in addition, it cannot be used if there are no global barriers in the program. We are exploring an alternative called application-level, non-blocking checkpointing. In our approach, programs are transformed by a pre-processor so that they become self-checkpointing and self-restartable on any platform; there is also no assumption about the existence of global barriers in the code. In this paper, we describe our implementation of application-level, non-blocking checkpointing. We present experimental results on both a Windows cluster and a Compaq Alpha cluster, which show that the overheads introduced by our approach are small.