The Sprite Network Operating System
Computer
A concurrent, generational garbage collector for a multithreaded implementation of ML
POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Portable, unobtrusive garbage collection for multiprocessor systems
POPL '94 Proceedings of the 21st ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Garbage collection: algorithms for automatic dynamic memory management
Garbage collection: algorithms for automatic dynamic memory management
In search of clusters (2nd ed.)
In search of clusters (2nd ed.)
The grid: blueprint for a new computing infrastructure
The grid: blueprint for a new computing infrastructure
Persistent execution state of a Java virtual machine
Proceedings of the ACM 2000 conference on Java Grande
On-the-fly garbage collection: an exercise in cooperation
Communications of the ACM
CLIP: a checkpointing tool for message-passing parallel programs
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Symphony: An Infrastructure for Managing Virtual Servers
Cluster Computing
Mobile agents with Java: The Aglet API
World Wide Web
Globe: A Wide-Area Distributed System
IEEE Concurrency
A Case for NOW (Networks of Workstations)
IEEE Micro
Bytecode Transformation for Portable Thread Migration in Java
ASA/MA 2000 Proceedings of the Second International Symposium on Agent Systems and Applications and Fourth International Symposium on Mobile Agents
Portable Support for Transparent Thread Migration in Java
ASA/MA 2000 Proceedings of the Second International Symposium on Agent Systems and Applications and Fourth International Symposium on Mobile Agents
Efficient Incremental Checkpointing of Java Programs
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Design, Implementation, and Performance of Checkpointing in NetSolve
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Portable Checkpointing for Heterogeneous Archtitectures
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Checkpointing and Its Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance
Process Introspection: A Heterogeneous Checkpoint/Restart Mechanism Based on Automatic Code Modification
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Making Java applications mobile or persistent
COOTS'01 Proceedings of the 6th conference on USENIX Conference on Object-Oriented Technologies and Systems - Volume 6
Dynamic state restoration using versioning exceptions
Higher-Order and Symbolic Computation
Model-based performance evaluation of distributed checkpointing protocols
Performance Evaluation
Architecting Dependable and Secure Systems Using Virtualization
Architecting Dependable Systems V
HiPC'07 Proceedings of the 14th international conference on High performance computing
Proceedings of the 8th International Workshop on Middleware for Grids, Clouds and e-Science
Concurrency and Computation: Practice & Experience
Hi-index | 0.01 |
Checkpointing an application is the act of saving the application's state during its execution on stable storage, so that if the application fails it can be restarted from the last saved state, thereby avoiding loss of the work that was already done. A heterogeneous checkpoint/restart mechanism allows one to restart an application on a possibly different hardware architecture and/or operating system than those in which the application was saved. This paper explores how to construct such a mechanism at the virtual machine level. That is, rather than dumping the entire state of the application process, the mechanism reported here dumps the state of the application as maintained by a virtual machine. During restart, the saved state is loaded into a new copy of the virtual machine, which continues running from there. The heterogeneous checkpoint/restart mechanism reported here was developed for the OCaml variant of ML. The paper reports on the main issues encountered in building such a mechanism and the design choices made, presents performance evaluations, and discusses some lessons and ideas for extending the work to native code OCaml and Java.