An Efficient Checkpointing Algorithm for Distributed Systems Implementing Reliable Communication Channels

Authors:
Eugene Gendelman;Lubomir F. Bic;Michael B. Dillencourt
Affiliations:
-;-;-
Venue:
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Year:
1999

Citing 3
Cited 1

Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Checkpointing and rollback-recovery algorithms in distributed systems

Journal of Systems and Software - Special issue on fault tolerance in real-time systems
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)

Process Interconnection Structures in Dynamically Changing Topologies

HiPC '00 Proceedings of the 7th International Conference on High Performance Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a new checkpointing algorithm that guarantees the semantics of reliable communication channels despite the crash and recovery of processes. This algorithm requires O(n + m) communication messages, where n is the number of participating processes, and m is the number of "late" messages. The algorithm is non-blocking, requires minimal message logging, and has minimal stable storage requirements. This algorithm is also scalable, simple, transparent to the user, and facilitates fast recovery. By introducing suitable delay in the checkpointing process, the parameter m can be made small. We also describe a variant of the algorithm that requires only O(n) messages, at a cost of O(n) additional storage for each process.