Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System

Authors:
Parameswaran Ramanathan;Kang G. Shin
Affiliations:
Univ. of Wisconsin-Madison, Madison;The Univ. of Michigan, Ann Arbor
Venue:
IEEE Transactions on Software Engineering
Year:
1993

Citing 10
Cited 9

Using Time Instead of Timeout for Fault-Tolerant Distributed Systems.

ACM Transactions on Programming Languages and Systems (TOPLAS)
Ensuring Fault Tolerance of Phase-Locked Clocks

IEEE Transactions on Computers
On the minimal synchronism needed for distributed consensus

Journal of the ACM (JACM)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Clock synchronization of a large multiprocessor system in the presence of malicious faults

IEEE Transactions on Computers
Adding time to synchronous process communications

IEEE Transactions on Computers - Special Issue on Real-Time Systems
Optimal checkpointing of real-time tasks

IEEE Transactions on Computers
Impossibility of distributed consensus with one faulty process

Journal of the ACM (JACM)
Reliability Issues in Computing System Design

ACM Computing Surveys (CSUR)
A program structure for error detection and recovery

Operating Systems, Proceedings of an International Symposium

On Coordinated Checkpointing in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Checkpointing with mutable checkpoints

Theoretical Computer Science - Dependable computing
Validating Requirements for Fault Tolerant Systems using Model Checking

ICRE '98 Proceedings of the 3rd International Conference on Requirements Engineering: Putting Requirements Engineering to Practice
Fault propagation analysis based variable length checkpoint placement for fault-tolerant parallel and distributed systems

COMPSAC '97 Proceedings of the 21st International Computer Software and Applications Conference
Finding a Recovery Line in Uncoordinated Checkpointing

ICDCSW '04 Proceedings of the 24th International Conference on Distributed Computing Systems Workshops - W7: EC (ICDCSW'04) - Volume 7
A novel min-process checkpointing scheme for mobile computing systems

Journal of Systems Architecture: the EUROMICRO Journal
An efficient protocol for checkpoint-based failure recovery in distributed systems

ICDCIT'04 Proceedings of the First international conference on Distributed Computing and Internet Technology
Performance evaluation of cloud service considering fault recovery

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

An approach to checkpointing and rollback recovery in a distributed computing system using a common time base is proposed. A common time base is established in the system using a hardware clock synchronization algorithm. This common time base is coupled with the idea of pseudo-recovery points to develop a checkpointing algorithm that has the following advantages: reduced wait for commitment for establishing recovery lines, fewer messages to be exchanged, and less memory requirement. These advantages are assessed quantitatively by developing a probabilistic model.