Rollback and Recovery Strategies for Computer Programs

Authors:
K. M. Chandy;C. V. Ramamoorthy
Affiliations:
Department of Computer Sciences, University of Texas, Austin, Tex. 78712.;Departments of Electrical Engineering and Computer sciences, University of Texas, Austin, Tex. 78712.
Venue:
IEEE Transactions on Computers
Year:
1972

Citing 0
Cited 12

Efficient algorithms for analyzing and synthesizing fault-tolerant datapaths

DFT '95 Proceedings of the IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems
On-line Testing and Recovery in TMR Systems for Real-Time Applications

ITC '01 Proceedings of the 2001 IEEE International Test Conference
An on-line algorithm for checkpoint placement

ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
Optimal Recovery Point Insertion for High-Level Synthesis of Recoverable Microarchitectures

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Design and Evaluation of a Fault-Tolerant Multiprocessor Using Hardware Recovery Blocks

IEEE Transactions on Computers
Protection Against External Errors in a Dedicated System

IEEE Transactions on Computers
Bristlecone: A Language for Robust Software Systems

ECOOP '08 Proceedings of the 22nd European conference on Object-Oriented Programming
Achieving software robustness via large-scale multiagent systems

Software engineering for large-scale multi-agent systems
Recovery tasks: an automated approach to failure recovery

RV'10 Proceedings of the First international conference on Runtime verification
Rigorous fault tolerance using aspects and formal methods

Rigorous Development of Complex Fault-Tolerant Systems
Optimal checkpointing intervals of three error detection schemes by a double modular redundancy

Mathematical and Computer Modelling: An International Journal
Online checkpointing with improved worst-case guarantees

ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part I

Quantified Score

Hi-index	14.99

Visualization

Abstract

Reliability is an important aspect of any system. On-line diagnosis, parity check coding, triple modular redundancy, and other methods have been used to improve the reliability of computing systems. In this paper another aspect of reliable computing systems is explored. The problem is that of recovering error-free information when an error is detected at some stage in the processing of a program. If an error or fault is detected while a program is being processed and if it cannot be corrected immediately, it may be necessary to run the entire program again. The time spent in rerunning the program may be substantial and in some real time applications critical. Recovery time can be reduced by saving states of the program (all the information stored in registers, primary and secondary storage, etc.) at intervals, as the processing continues. If an error is detected the program is restarted from its most recently saved state. However, a price is paid in saving a state in the form of time spent storing all the relevant information in secondary storage. Hence it is expensive to save the state of the program too often. Not saving any state of the program may cause an unacceptably large recovery time. The problem that we solve is the following. Determine the optimum points at which the state of the program should be stored to recover after any malfunction.