User-Triggered Checkpointing: System-Independent and Scalable Application Recovery

  • Authors:
  • Geert Deconinck;Rudy Lauwereins

  • Affiliations:
  • -;-

  • Venue:
  • ISCC '97 Proceedings of the 2nd IEEE Symposium on Computers and Communications (ISCC '97)
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

User-triggered checkpointing and rollback is proposed as a system-independent and flexible way to integrate backward error recovery in long-running, computation-intensive message-passing applications on large parallel multicomputers. It employs library calls to coordinate the checkpointing, allowing a non-blocking and scalable approach that requires no protocol to save a consistent state because the coordination among the processes is implicit. The explicit indication of the checkpoint contents (i.e. the items of which the state must be saved) allows to significantly reduce the amount of checkpoint data and the overhead. In contrast to other checkpointing approaches, the implementation does not rely on system-dependent features (like saving register-values or communication status) to save the state. Instead, re-executing the first part of the application brings the system-specific items into a consistent state with the rest of the checkpoint contents that is restored from the saved checkpoint data.