Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance

Authors:
Jose Carlos Sancho;Fabrizio Petrini;Kei Davis;Roberto Gioiosa;Song Jiang
Affiliations:
Los Alamos National Laboratory, NM;Los Alamos National Laboratory, NM;Los Alamos National Laboratory, NM;Los Alamos National Laboratory, NM;Los Alamos National Laboratory, NM
Venue:
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Year:
2005

Citing 19
Cited 5

The MOSIX multicomputer operating system for high performance cluster computing

Future Generation Computer Systems - Special issue on HPCN '97
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
PM2: a high performance communication middleware for heterogeneous network environments

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
The implementation of dynamite: an environment for migrating PVM tasks

ACM SIGOPS Operating Systems Review
CLIP: a checkpointing tool for message-passing parallel programs

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
BProc: the Beowulf distributed process space

ICS '02 Proceedings of the 16th international conference on Supercomputing
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Managing Checkpoints for Parallel Programs

IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
User-Level Checkpointing for LinuxThreads Programs

Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Checkpointing and Its Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Exploiting Operating System Services to Effciently Checkpoint Parallel Applications in GENESIS

ICA3PP '02 Proceedings of the Fifth International Conference on Algorithms and Architectures for Parallel Processing
Adaptive incremental checkpointing for massively parallel systems

Proceedings of the 18th annual international conference on Supercomputing
The dawning of the autonomic computing era

IBM Systems Journal
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
The design and implementation of Zap: a system for migrating computing environments

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
A performance comparison of UNIX operating systems on the Pentium

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference

Virtual machine time travel using continuous data protection and checkpointing

ACM SIGOPS Operating Systems Review
Transparent checkpoint-restart of multiple processes on commodity operating systems

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Linux Support for Fast Transparent General Purpose Checkpoint/Restart of Multithreaded Processes in Loadable Kernel Module

Journal of Grid Computing
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Checkpoint/restart is a general idea for which particular implementations enable various functionalities in computer systems, including process migration, gang scheduling, hibernation, and fault tolerance. For fault tolerance, in current practice, implementations can be at user-level or system-level. User-level implementations are relatively easy to implement and portable, but suffer from a lack of transparency, flexibility, and efficiency, and in particular are unsuitable for the autonomic (self-managing) computing systems envisioned as the next revolutionary development in system management. In contrast, a system-level implementation can exhibit all of these desirable features, at the cost of a more sophisticated implementation, and is seen as an essential mechanism for the next generation of fault tolerant-and ultimately autonomic-large-scale computing systems. Linux is becoming the operating system of choice for the largest-scale machines, but development of system-level checkpoint/restart mechanisms for Linux is still in its infancy, with all extant implementations exhibiting serious deficiencies for achieving transparent fault tolerance. This paper provides a survey of extant implementations in a natural taxonomy, highlighting their strengths and inherent weaknesses.