A comparative analysis of the reliability of simple and two-level checkpointing techniques in two different distributed industrial control system architectures

Authors:
Alicia Rubio;Rafael Ors
Affiliations:
Fault Tolerant Computing Group (GSTF), Departamento de Informática de Sistemas y Computadores (DISCA), Politechnical University of Valencia (UPV), Camino de Vera s/n., 46022 Valencia, Spain;Fault Tolerant Computing Group (GSTF), Departamento de Informática de Sistemas y Computadores (DISCA), Politechnical University of Valencia (UPV), Camino de Vera s/n., 46022 Valencia, Spain
Venue:
Systems Analysis Modelling Simulation
Year:
2003

Citing 4
Cited 0

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks

IEEE Transactions on Parallel and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In Distributed industrial control systems it is necessary to guarantee certain reliability level. In this sense, Checkpointing and Rollback techniques offer interesting possibilities to achieve fault tolerance without appreciable cost and complexity increment. Several Checkpointing techniques have been proposed. Most of them suppose the presence of stable storage in the system. But distributed industrial control systems usually do not dispose of this kind of storage. So, another storage strategy has to be employed. If Checkpoints were locally stored (Simple Checkpointing), the system tolerates only transient faults. If Checkpoints were locally, at the same node, and, additionally, at another/s node/s of the system stored (Two-level Checkpointing), the system can recover from some permanent faults too. In this article the results of a study of the reliability of these two different Checkpoint storage strategies were presented in order to evaluate if the reliability increase of the Two-level method justifies its greater complexity. In order to accomplish this study, two distributed industrial control systems were presented. Each of them are based on a different node architecture which will have an important effect upon the results of the study.