Using Time Instead of Timeout for Fault-Tolerant Distributed Systems.
ACM Transactions on Programming Languages and Systems (TOPLAS)
Ensuring Fault Tolerance of Phase-Locked Clocks
IEEE Transactions on Computers
On the minimal synchronism needed for distributed consensus
Journal of the ACM (JACM)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Clock synchronization of a large multiprocessor system in the presence of malicious faults
IEEE Transactions on Computers
Adding time to synchronous process communications
IEEE Transactions on Computers - Special Issue on Real-Time Systems
Optimal checkpointing of real-time tasks
IEEE Transactions on Computers
Impossibility of distributed consensus with one faulty process
Journal of the ACM (JACM)
Reliability Issues in Computing System Design
ACM Computing Surveys (CSUR)
A program structure for error detection and recovery
Operating Systems, Proceedings of an International Symposium
On Coordinated Checkpointing in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems
IEEE Transactions on Parallel and Distributed Systems
Checkpointing with mutable checkpoints
Theoretical Computer Science - Dependable computing
Validating Requirements for Fault Tolerant Systems using Model Checking
ICRE '98 Proceedings of the 3rd International Conference on Requirements Engineering: Putting Requirements Engineering to Practice
COMPSAC '97 Proceedings of the 21st International Computer Software and Applications Conference
Finding a Recovery Line in Uncoordinated Checkpointing
ICDCSW '04 Proceedings of the 24th International Conference on Distributed Computing Systems Workshops - W7: EC (ICDCSW'04) - Volume 7
A novel min-process checkpointing scheme for mobile computing systems
Journal of Systems Architecture: the EUROMICRO Journal
An efficient protocol for checkpoint-based failure recovery in distributed systems
ICDCIT'04 Proceedings of the First international conference on Distributed Computing and Internet Technology
Performance evaluation of cloud service considering fault recovery
The Journal of Supercomputing
Hi-index | 0.00 |
An approach to checkpointing and rollback recovery in a distributed computing system using a common time base is proposed. A common time base is established in the system using a hardware clock synchronization algorithm. This common time base is coupled with the idea of pseudo-recovery points to develop a checkpointing algorithm that has the following advantages: reduced wait for commitment for establishing recovery lines, fewer messages to be exchanged, and less memory requirement. These advantages are assessed quantitatively by developing a probabilistic model.