Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Detection of stable properties in distributed applications
PODC '87 Proceedings of the sixth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
Stabilizing Communication Protocols
IEEE Transactions on Computers - Special issue on protocol engineering
Synthesis of Communication Protocols: Survey and Assessment
IEEE Transactions on Computers - Special issue on protocol engineering
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Introduction to Program Fault Tolerance
Introduction to Program Fault Tolerance
Fault Tolerance: Principles and Practice
Fault Tolerance: Principles and Practice
Hi-index | 0.24 |
This paper assesses the use of Chandy and Lamport's distributed snapshots algorithm (DSA) for stabilizing a communication protocol, a special type of distributed system. We show that when a loss of coordination occurs during the distributed execution of the protocol, DSA is not guaranteed to terminate, and therefore it sometimes fails to obtain a global state or snapshot. We propose some modifications to DSA to solve this problem. Finally, we discuss how, in the case of a loss of coordination, the modified algorithm can be used to stabilize a communication protocol, and we assess the suitability of the global state obtained by DSA as a recovery point to be used later in a backward recovery procedure.