Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
Performance Analysis of Two Time-Based Coordinated Checkpointing Protocols
PRFTS '97 Proceedings of the 1997 Pacific Rim International Symposium on Fault-Tolerant Systems
On Low-Cost Error Containment and Recovery Methods for Guarded Software Upgrading
ICDCS '00 Proceedings of the The 20th International Conference on Distributed Computing Systems ( ICDCS 2000)
Using Time to Improve the Performance of Coordinated Checkpointing
IPDS '96 Proceedings of the 2nd International Computer Performance and Dependability Symposium (IPDS '96)
Hi-index | 0.00 |
Abstract: This paper describes an approach for enabling the synergistic coordination between two fault tolerance protocols to simultaneously tolerate software and hardware faults in a distributed computing environment. Specifically, our approach is based on a message-driven confidence-driven (MDCD) protocol that we have devised for tolerating software design faults, and a time-based (TB) checkpointing protocol that was developed by Neves and Fuchs for tolerating hardware faults. By carrying out algorithm modifications that are conducive to synergistic coordination between volatile-storage and stable-storage checkpoint establishments, we are able to circumvent the potential interference between the MDCD and TB protocols, and to allow them to effectively complement each other to extend a system's fault tolerance capability. Moreover, the protocol-coordination approach preserves and enhances the features and advantages of the individual protocols that participate in the coordination, keeping the performance cost low.