An Experimental Study to Determine Task Size for Rollback Recovery Systems
IEEE Transactions on Computers
Automated micro-roll-back self-recovery synthesis
DAC '91 Proceedings of the 28th ACM/IEEE Design Automation Conference
Transient Fault Tolerance in Digital Systems
IEEE Micro
A Time Redundancy Approach to TMR Failures Using Fault-State Likelihoods
IEEE Transactions on Computers
A study of time redundant fault tolerance techniques for superscalar processors
DFT '95 Proceedings of the IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems
Hierarchical Verification for Increasing Performance in Reliable Processors
Journal of Electronic Testing: Theory and Applications
Hi-index | 0.01 |
A common assumption in the existing rollback techniques is that transients, the cause of most failures, subside very quickly, implying that a single retry of the program from the previous rollback point is sufficient. We discuss a general rollback strategy with n(n ≥ 2) retries which takes into consideration multiple transient failures as well as transients of long duration. Ways of deriving practical values of n for a given program are also discussed. Furthermore, we propose the use of a watchdog processor as an error detection tool to initiate recovery action through rollback, since the watchdog processor offers low error latency. We also discuss the merging of the watchdog processor with rollback recovery technique for enhancing the overall system reliability.