FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

Authors:
Gengbin Zheng;Lixia Shi;L. V. Kale
Affiliations:
Dept. of Comput. Sci., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA;Dept. of Comput. Sci., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA;Dept. of Comput. Sci., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
Venue:
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Year:
2004

Citing 0
Cited 24

Performance evaluation of adaptive MPI

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
HPC-Colony: services and interfaces for very large systems

ACM SIGOPS Operating Systems Review
Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++

ACM SIGOPS Operating Systems Review
Toward Exascale Resilience

International Journal of High Performance Computing Applications
Otherworld: giving applications a chance to survive OS kernel crashes

Proceedings of the 5th European conference on Computer systems
A scalable asynchronous replication-based strategy for fault tolerant MPI applications

HiPC'07 Proceedings of the 14th international conference on High performance computing
Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Robust non-intrusive record-replay with processor extraction

Proceedings of the 8th Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging
Support for adaptivity in ARMCI using migratable objects

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Fast checkpoint recovery algorithms for frequently consistent applications

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Tolerating correlated failures for generalized Cartesian distributions via bipartite matching

Proceedings of the 8th ACM International Conference on Computing Frontiers
SpotMPI: a framework for auction-based HPC computing using amazon spot instances

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
Proactive fault tolerance in MPI applications via task migration

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
A checkpoint/recovery model for heterogeneous dataflow computations using work-stealing

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Impact of over-decomposition on coordinated checkpoint/rollback protocol

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Enabling Application Resilience with and without the MPI Standard

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Fault prediction under the microscope: a closer look into HPC systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Abstractions and Middleware for Petascale Computing and Beyond

International Journal of Distributed Systems and Technologies
A 1 PB/s file system to checkpoint three million MPI tasks

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
A 'cool' way of improving the reliability of HPC machines

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
ACR: automatic checkpoint/restart for soft and hard error protection

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Banking on decoupling: budget-driven sustainability for HPC applications on auction-based clouds

ACM SIGOPS Operating Systems Review
Failure prediction for HPC systems and applications: Current situation and open issues

International Journal of High Performance Computing Applications
X10-FT: Transparent fault tolerance for APGAS language and runtime

Parallel Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

As high performance clusters continue to grow in size, the mean time between failures shrinks. Thus, the issues of fault tolerance and reliability are becoming one of the challenging factors for application scalability. The traditional disk-based method of dealing with faults is to checkpoint the state of the entire application periodically to reliable storage and restart from the recent checkpoint. The recovery of the application from faults involves (often manually) restarting applications on all processors and having it read the data from disks on all processors. The restart can therefore take minutes after it has been initiated. Such a strategy requires that the failed processor can be replaced so that the number of processors at checkpoint-time and recovery-time are the same. We present FTC-Charms ++, a fault-tolerant runtime based on a scheme for fast and scalable in-memory checkpoint and restart. At restart, when there is no extra processor, the program can continue to run on the remaining processors while minimizing the performance penalty due to losing processors. The method is useful for applications whose memory footprint is small at the checkpoint state, while a variation of this scheme - in-disk checkpoint/restart can be applied to applications with large memory footprint. The scheme does not require any individual component to be fault-free. We have implemented this scheme for Charms++ and AMPI (an adaptive version of MPl). This work describes the scheme and shows performance data on a cluster using 128 processors.