User-Triggered Checkpointing: System-Independent and Scalable Application Recovery

Authors:
Geert Deconinck;Rudy Lauwereins
Affiliations:
-;-
Venue:
ISCC '97 Proceedings of the 2nd IEEE Symposium on Computers and Communications (ISCC '97)
Year:
1997

Citing 4
Cited 0

Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
On distributed snapshots

Information Processing Letters
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

User-triggered checkpointing and rollback is proposed as a system-independent and flexible way to integrate backward error recovery in long-running, computation-intensive message-passing applications on large parallel multicomputers. It employs library calls to coordinate the checkpointing, allowing a non-blocking and scalable approach that requires no protocol to save a consistent state because the coordination among the processes is implicit. The explicit indication of the checkpoint contents (i.e. the items of which the state must be saved) allows to significantly reduce the amount of checkpoint data and the overhead. In contrast to other checkpointing approaches, the implementation does not rely on system-dependent features (like saving register-values or communication status) to save the state. Instead, re-executing the first part of the application brings the system-specific items into a consistent state with the rest of the checkpoint contents that is restored from the saved checkpoint data.