Virtual-machine-based heterogeneous checkpointing

Authors:
Adnan Agbaria;Roy Friedman
Affiliations:
Computer Science Department, Technion -- Israel Institute of Technology, Haifa 32000, Israel;Computer Science Department, Technion -- Israel Institute of Technology, Haifa 32000, Israel
Venue:
Software—Practice & Experience
Year:
2002

Citing 24
Cited 6

The Sprite Network Operating System

Computer
A concurrent, generational garbage collector for a multithreaded implementation of ML

POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Portable, unobtrusive garbage collection for multiprocessor systems

POPL '94 Proceedings of the 21st ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Garbage collection: algorithms for automatic dynamic memory management

Garbage collection: algorithms for automatic dynamic memory management
In search of clusters (2nd ed.)

In search of clusters (2nd ed.)
The grid: blueprint for a new computing infrastructure

The grid: blueprint for a new computing infrastructure
Persistent execution state of a Java virtual machine

Proceedings of the ACM 2000 conference on Java Grande
On-the-fly garbage collection: an exercise in cooperation

Communications of the ACM
CLIP: a checkpointing tool for message-passing parallel programs

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Symphony: An Infrastructure for Managing Virtual Servers

Cluster Computing
Mobile agents with Java: The Aglet API

World Wide Web
Globe: A Wide-Area Distributed System

IEEE Concurrency
A Case for NOW (Networks of Workstations)

IEEE Micro
Bytecode Transformation for Portable Thread Migration in Java

ASA/MA 2000 Proceedings of the Second International Symposium on Agent Systems and Applications and Fourth International Symposium on Mobile Agents
Portable Support for Transparent Thread Migration in Java

ASA/MA 2000 Proceedings of the Second International Symposium on Agent Systems and Applications and Fourth International Symposium on Mobile Agents
Efficient Incremental Checkpointing of Java Programs

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Design, Implementation, and Performance of Checkpointing in NetSolve

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Portable Checkpointing for Heterogeneous Archtitectures

FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Checkpointing and Its Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance

An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance
Process Introspection: A Heterogeneous Checkpoint/Restart Mechanism Based on Automatic Code Modification

Process Introspection: A Heterogeneous Checkpoint/Restart Mechanism Based on Automatic Code Modification
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Making Java applications mobile or persistent

COOTS'01 Proceedings of the 6th conference on USENIX Conference on Object-Oriented Technologies and Systems - Volume 6

Dynamic state restoration using versioning exceptions

Higher-Order and Symbolic Computation
Model-based performance evaluation of distributed checkpointing protocols

Performance Evaluation
Architecting Dependable and Secure Systems Using Virtualization

Architecting Dependable Systems V
ROW-FS: a user-level virtualized redirect-on-write distributed file system for wide area applications

HiPC'07 Proceedings of the 14th international conference on High performance computing
CRM-OO-VM: a checkpointing-enabled Java VM for efficient and reliable e-science applications in grids

Proceedings of the 8th International Workshop on Middleware for Grids, Clouds and e-Science
A checkpointing-enabled and resource-aware Java Virtual Machine for efficient and robust e-Science applications in grid environments

Concurrency and Computation: Practice & Experience

Quantified Score

Hi-index	0.01

Visualization

Abstract

Checkpointing an application is the act of saving the application's state during its execution on stable storage, so that if the application fails it can be restarted from the last saved state, thereby avoiding loss of the work that was already done. A heterogeneous checkpoint/restart mechanism allows one to restart an application on a possibly different hardware architecture and/or operating system than those in which the application was saved. This paper explores how to construct such a mechanism at the virtual machine level. That is, rather than dumping the entire state of the application process, the mechanism reported here dumps the state of the application as maintained by a virtual machine. During restart, the saved state is loaded into a new copy of the virtual machine, which continues running from there. The heterogeneous checkpoint/restart mechanism reported here was developed for the OCaml variant of ML. The paper reports on the main issues encountered in building such a mechanism and the design choices made, presents performance evaluations, and discusses some lessons and ideas for extending the work to native code OCaml and Java.