Application-Level checkpointing techniques for parallel programs

  • Authors:
  • John Paul Walters;Vipin Chaudhary

  • Affiliations:
  • Institute for Scientific Computing, Wayne State University;Department of Computer Science and Engineering, University at Buffalo, The State University of New York

  • Venue:
  • ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In its simplest form, checkpointing is the act of saving a program's computation state in a form external to the running program, e.g. the computation state is saved to a filesystem. The checkpoint files can then be used to resume computation upon failure of the original process(s), hopefully with minimal loss of computing work. A checkpoint can be taken using a variety of techniques in every level of the system, from utilizing special hardware/architectural checkpointing features through modification of the user's source code. This survey will discuss the various techniques used in application-level checkpointing, with special attention being paid to techniques for checkpointing parallel and distributed applications.