Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++

Authors:
Gengbin Zheng;Chao Huang;Laxmikant V. Kalé
Affiliations:
University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign
Venue:
ACM SIGOPS Operating Systems Review
Year:
2006

Citing 12
Cited 1

Space reclamation for uncoordinated checkpointing in message-passing systems

Space reclamation for uncoordinated checkpointing in message-passing systems
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
CLIP: a checkpointing tool for message-passing parallel programs

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
The MOSIX Distributed Operating System: Load Balancing for UNIX

The MOSIX Distributed Operating System: Load Balancing for UNIX
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
An Efficient and Transparent Thread Migration Scheme in the PM2 Runtime System

Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
An Analysis of Communication-Induced Checkpointing

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Libckpt: Transparent Checkpointing under Unix

Libckpt: Transparent Checkpointing under Unix
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Performance evaluation of adaptive MPI

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Achieving high performance on extremely large parallel machines: performance prediction and load balancing

Achieving high performance on extremely large parallel machines: performance prediction and load balancing

Interconnect agnostic checkpoint/restart in open MPI

Proceedings of the 18th ACM international symposium on High performance distributed computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

As the size of high performance clusters multiplies, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a fault occurs, the application is restarted from a recent checkpoint. However, the application developer is required to write significant additional code for checkpointing and restarting. This paper describes disk-based and memory-based checkpointing fault tolerance schemes that automate the task of checkpointing and restarting. The schemes also allow the program to be restarted on a different number of processors. These schemes are based on self-checkpointable, migratable objects supported by the Adaptive MPI (AMPI) and Charm++ run-time and can be applied to a wide class of applications written using MPI or message-driven languages. We demonstrate the effectiveness of the strategies and evaluate their performance.