The MOSIX multicomputer operating system for high performance cluster computing
Future Generation Computer Systems - Special issue on HPCN '97
Fail-stop processors: an approach to designing fault-tolerant computing systems
ACM Transactions on Computer Systems (TOCS)
PM2: a high performance communication middleware for heterogeneous network environments
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
The implementation of dynamite: an environment for migrating PVM tasks
ACM SIGOPS Operating Systems Review
CLIP: a checkpointing tool for message-passing parallel programs
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
BProc: the Beowulf distributed process space
ICS '02 Proceedings of the 16th international conference on Supercomputing
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Managing Checkpoints for Parallel Programs
IPPS '96 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
User-Level Checkpointing for LinuxThreads Programs
Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference
Automated application-level checkpointing of MPI programs
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Checkpointing and Its Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Exploiting Operating System Services to Effciently Checkpoint Parallel Applications in GENESIS
ICA3PP '02 Proceedings of the Fifth International Conference on Algorithms and Architectures for Parallel Processing
Adaptive incremental checkpointing for massively parallel systems
Proceedings of the 18th annual international conference on Supercomputing
The dawning of the autonomic computing era
IBM Systems Journal
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
The design and implementation of Zap: a system for migrating computing environments
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
A performance comparison of UNIX operating systems on the Pentium
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Virtual machine time travel using continuous data protection and checkpointing
ACM SIGOPS Operating Systems Review
Transparent checkpoint-restart of multiple processes on commodity operating systems
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
The Journal of Supercomputing
Hi-index | 0.00 |
Checkpoint/restart is a general idea for which particular implementations enable various functionalities in computer systems, including process migration, gang scheduling, hibernation, and fault tolerance. For fault tolerance, in current practice, implementations can be at user-level or system-level. User-level implementations are relatively easy to implement and portable, but suffer from a lack of transparency, flexibility, and efficiency, and in particular are unsuitable for the autonomic (self-managing) computing systems envisioned as the next revolutionary development in system management. In contrast, a system-level implementation can exhibit all of these desirable features, at the cost of a more sophisticated implementation, and is seen as an essential mechanism for the next generation of fault tolerant-and ultimately autonomic-large-scale computing systems. Linux is becoming the operating system of choice for the largest-scale machines, but development of system-level checkpoint/restart mechanisms for Linux is still in its infancy, with all extant implementations exhibiting serious deficiencies for achieving transparent fault tolerance. This paper provides a survey of extant implementations in a natural taxonomy, highlighting their strengths and inherent weaknesses.