A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Asynchronous recovery without using vector timestamps
Journal of Parallel and Distributed Computing
Checkpointing and Rollback of Wide-area Distributed Applications using Mobile Agents
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
An Adaptive Checkpointing Protocol to Bound Recovery Time with Message Logging
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Evaluating Distributed Checkpointing Protocol
ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
On Properties of RDT Communication-Induced Checkpointing Protocols
IEEE Transactions on Parallel and Distributed Systems
Quantifying rollback propagation in distributed checkpointing
Journal of Parallel and Distributed Computing
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
IEEE Transactions on Dependable and Secure Computing
A New Approach for High Performance Computing Systems with Various Checkpointing Schemes
The Journal of Supercomputing
An Efficient Index-Based Checkpointing Protocol with Constant-Size Control Information on Messages
IEEE Transactions on Dependable and Secure Computing
Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++
ACM SIGOPS Operating Systems Review
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Quasi-atomic recovery for distributed agents
Parallel Computing
Exploring failure transparency and the limits of generic recovery
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
DS-RT '07 Proceedings of the 11th IEEE International Symposium on Distributed Simulation and Real-Time Applications
Model-based performance evaluation of distributed checkpointing protocols
Performance Evaluation
Coordinated checkpoint versus message log for fault tolerant MPI
International Journal of High Performance Computing and Networking
Journal of Parallel and Distributed Computing
Correlated set coordination in fault tolerant message logging protocols
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
A hybrid message Logging-CIC protocol for constrained checkpointability
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Performance evaluation of consistent recovery protocols using MPICH-GF
EDCC'05 Proceedings of the 5th European conference on Dependable Computing
Simulating application resilience at exascale
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Cache-aware memory manager for optimistic simulations
Proceedings of the 5th International ICST Conference on Simulation Tools and Techniques
Composable reliability for asynchronous systems
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
A multi-cycle checkpointing protocol that ensures strict 1-rollback
Information Processing Letters
Exploring reliability of exascale systems through simulations
Proceedings of the High Performance Computing Symposium
Hi-index | 0.00 |
Communication induced checkpointing (CIC) allows processes in a distributed computation to take independent checkpoints and to avoid the domino effect. This paper presents an analysis of CIC protocols based on a prototype implementation and validated simulations. Our result inidcate that there is sufficient evidence to suspect that much of the conventional wisdom about these protocols is questionable.