Efficient algorithms for analyzing and synthesizing fault-tolerant datapaths
DFT '95 Proceedings of the IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems
On-line Testing and Recovery in TMR Systems for Real-Time Applications
ITC '01 Proceedings of the 2001 IEEE International Test Conference
An on-line algorithm for checkpoint placement
ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
Optimal Recovery Point Insertion for High-Level Synthesis of Recoverable Microarchitectures
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Design and Evaluation of a Fault-Tolerant Multiprocessor Using Hardware Recovery Blocks
IEEE Transactions on Computers
Protection Against External Errors in a Dedicated System
IEEE Transactions on Computers
Bristlecone: A Language for Robust Software Systems
ECOOP '08 Proceedings of the 22nd European conference on Object-Oriented Programming
Achieving software robustness via large-scale multiagent systems
Software engineering for large-scale multi-agent systems
Recovery tasks: an automated approach to failure recovery
RV'10 Proceedings of the First international conference on Runtime verification
Rigorous fault tolerance using aspects and formal methods
Rigorous Development of Complex Fault-Tolerant Systems
Optimal checkpointing intervals of three error detection schemes by a double modular redundancy
Mathematical and Computer Modelling: An International Journal
Online checkpointing with improved worst-case guarantees
ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part I
Hi-index | 14.99 |
Reliability is an important aspect of any system. On-line diagnosis, parity check coding, triple modular redundancy, and other methods have been used to improve the reliability of computing systems. In this paper another aspect of reliable computing systems is explored. The problem is that of recovering error-free information when an error is detected at some stage in the processing of a program. If an error or fault is detected while a program is being processed and if it cannot be corrected immediately, it may be necessary to run the entire program again. The time spent in rerunning the program may be substantial and in some real time applications critical. Recovery time can be reduced by saving states of the program (all the information stored in registers, primary and secondary storage, etc.) at intervals, as the processing continues. If an error is detected the program is restarted from its most recently saved state. However, a price is paid in saving a state in the form of time spent storing all the relevant information in secondary storage. Hence it is expensive to save the state of the program too often. Not saving any state of the program may cause an unacceptably large recovery time. The problem that we solve is the following. Determine the optimum points at which the state of the program should be stored to recover after any malfunction.