The LOCUS distributed system architecture
The LOCUS distributed system architecture
Survey on user interface programming
CHI '92 Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A study of integrated prefetching and caching strategies
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
A new process migration algorithm
ACM SIGOPS Operating Systems Review
The MOSIX multicomputer operating system for high performance cluster computing
Future Generation Computer Systems - Special issue on HPCN '97
ACM Computing Surveys (CSUR)
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
User-Level Checkpointing for LinuxThreads Programs
Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference
The design and implementation of Zap: a system for migrating computing environments
ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Advanced Programming in the UNIX(R) Environment (2nd Edition)
Advanced Programming in the UNIX(R) Environment (2nd Edition)
Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Understanding The Linux Kernel
Understanding The Linux Kernel
Cruz: Application-Transparent Distributed Checkpoint-Restart on Standard Operating Systems
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
ISPDC '05 Proceedings of the The 4th International Symposium on Parallel and Distributed Computing
Linux Device Drivers, 3rd Edition
Linux Device Drivers, 3rd Edition
SockMi: a solution for migrating TCP/IP connections
PDP '07 Proceedings of the 15th Euromicro International Conference on Parallel, Distributed and Network-Based Processing
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Transparent checkpoint-restart of multiple processes on commodity operating systems
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Virtual servers and checkpoint/restart in mainstream Linux
ACM SIGOPS Operating Systems Review - Research and developments in the Linux kernel
Professional Linux Kernel Architecture
Professional Linux Kernel Architecture
The Linux Programming Interface: A Linux and UNIX System Programming Handbook
The Linux Programming Interface: A Linux and UNIX System Programming Handbook
Hi-index | 0.00 |
Checkpoint/Restart is the ability to save the state of a running application so that it can later resume its execution from the time of the checkpoint. These are techniques with many potential applications, including establishment of a fault-tolerant environment, improving system resource utilization, and true migration of a process. With increasing hardware speed and size of clusters the average time between failures has been reduced. Therefore, fault tolerance and ability to checkpoint a process have become inevitable. Almost all platforms deployed for high-performance computing support process checkpoint/restart. Linux as one of the popular operating systems does not provide a general purpose implementation. Some are limited to specific type of parallel programming library, confined to some unique well-behaved type of applications, or reliant on specific features in kernel which could be missing on many occasions. Most of implementations demand elaborate practice of recompiling a whole kernel to apply required patches. In this paper, we describe the design and implementation of multithreaded process checkpoint/restart system for Linux which provide capability of dynamic extension to increase compatibility and reduce system overhead. It does not impose any requirement on the existence of a special facility in the operating system and can do checkpoint/restart of an application independent of their behavior and fully transparent. The entire system is absolutely implemented in multiple kernel loadable modules, which result in ease of use and eliminate the burden of complex system administration.