Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Restoring consistent global states of distributed computations
PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
Necessary and Sufficient Conditions for Consistent Global Snapshots
IEEE Transactions on Parallel and Distributed Systems
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Journal of Systems and Software - Special issue on software engineering for distributed computing
Detection of Strong Unstable Predicates in Distributed Programs
IEEE Transactions on Parallel and Distributed Systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Checkpointing distributed applications on mobile computers
PDIS '94 Proceedings of the third international conference on on Parallel and distributed information systems
Finding Consistent Global Checkpoints in a Distributed Computation
IEEE Transactions on Parallel and Distributed Systems
Consistency Issues in Distributed Checkpoints
IEEE Transactions on Software Engineering
Virtual Precedence in Asynchronous Systems: Cencept and Applications
WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
A Communication-Induced Checkpointing Protocol that Ensures Rollback-Dependency Trackability
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Detecting conjunctive channel predicates in a distributed programming environment
HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences
Replaying Distributed Programs Without Message Logging
HPDC '97 Proceedings of the 6th IEEE International Symposium on High Performance Distributed Computing
An Index-Based Checkpointing Algorithm For Autonomous Distributed Systems
SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
Preventing Useless Checkpoints in Distributed Computations
SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
A Distributed Consistent Global Checkpoint Algorithm with a Minimum Number of Checkpoints
ICOIN '98 Proceedings of the 13th International Conference on Information Networking
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Interval consistency of asynchronous distributed computations
Journal of Computer and System Sciences
Shortcut Replay: A Replay Technique for Debugging Long-Running Parallel Programs
ASIAN '02 Proceedings of the7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster
Error detection in large-scale parallel programs with long runtimes
Future Generation Computer Systems - Tools for program development and analysis
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Data-stream-based global event monitoring using pairwise interactions
Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing
ICS'08 Proceedings of the 12th WSEAS international conference on Systems
Information Sciences: an International Journal
VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
Correlated set coordination in fault tolerant message logging protocols
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
An efficient computing-checkpoint based coordinated checkpoint algorithm
EUC'06 Proceedings of the 2006 international conference on Embedded and Ubiquitous Computing
Using computing checkpoints implement consistent low-cost non-blocking coordinated checkpointing
PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies
Hi-index | 0.00 |
A classical way to determine consistent snapshots consists in using Chandy-Lamport's algorithm. This algorithm relies on specific control messages that allow processes to synchronize local checkpoint determination and message recording in order for the resulting snapshot to be consistent. This paper investigates a communication-induced approach to determine consistent snapshots. In such an approach, control information is carried out by application messages. Two abstract necessary and sufficient conditions are stated: one associated with global checkpoint consistency, the other associated with message recording. A general protocol is derived from these abstract conditions. Actually, this general protocol can be instantiated in distinct ways, giving rise to a family of communication-induced snapshot protocols. This general protocol shows there is an intrinsic trade-off between the number of forced checkpoints and the number of recorded messages. Finally, a particular instantiation of the general protocol is provided.