Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
System structure for software fault tolerance
IEEE Transactions on Software Engineering
Hi-index | 0.98 |
This paper considers a communication system which consists of many processors and studies the problem for improving its reliability by adopting the recovery techniques of checkpoint and rollback. When either processor failure or communication error has occurred, the rollback recovery for processors associated with such an event is executed to the most recent checkpoint, and so, a consistent state in the whole system is maintained. The stochastic model with the above recovery techniques is formulated, using the theory of Markov renewal processes. The mean time to take checkpoint and the expected numbers of rollback recovery caused by processor failures and communication errors are derived. Further, an optimal checkpointing interval which minimizes the expected cost is analytically discussed.