Virtual Machine Based Heterogeneous Checkpointing

  • Authors:
  • Adnan Agbaria;Roy Friedman

  • Affiliations:
  • -;-

  • Venue:
  • IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Checkpointing an application is the act of saving the application's state during its execution on stable storage so that if the application fails, it can be restarted from the last saved state, thereby avoiding loss of the work that was already done. A heterogeneous checkpoint/restart mechanism allows to restart an application from a saved state that was taken in a hardware architecture and/or operating system that can be different from those in the machine on which it is restarted. This paper explores how to construct such a mechanism at the virtual machine level. That is, rather than dumping the entire state of the application process, the mechanism reported here dumps the state of the application w.r.t. a virtual machine. During restart, the saved state is loaded into a new copy of the virtual machine, which continues running from there. The heterogeneous checkpoint/restart mechanism reported here was developed for the OCaml variant of ML. The paper reports on the main issues encountered in building such a mechanism and the design choices made, presents performance evaluations, and discusses some lessons and ideas for extending the work to native code OCaml, and to Java Virtual Machines.