Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks
IEEE Transactions on Parallel and Distributed Systems
Hi-index | 0.00 |
In Distributed industrial control systems it is necessary to guarantee certain reliability level. In this sense, Checkpointing and Rollback techniques offer interesting possibilities to achieve fault tolerance without appreciable cost and complexity increment. Several Checkpointing techniques have been proposed. Most of them suppose the presence of stable storage in the system. But distributed industrial control systems usually do not dispose of this kind of storage. So, another storage strategy has to be employed. If Checkpoints were locally stored (Simple Checkpointing), the system tolerates only transient faults. If Checkpoints were locally, at the same node, and, additionally, at another/s node/s of the system stored (Two-level Checkpointing), the system can recover from some permanent faults too. In this article the results of a study of the reliability of these two different Checkpoint storage strategies were presented in order to evaluate if the reliability increase of the Two-level method justifies its greater complexity. In order to accomplish this study, two distributed industrial control systems were presented. Each of them are based on a different node architecture which will have an important effect upon the results of the study.