Reliable computer systems (2nd ed.): design and evaluation
Reliable computer systems (2nd ed.): design and evaluation
DEPEND: A Simulation-Based Environment for System Level Dependability Analysis
IEEE Transactions on Computers
Transient Fault Tolerance in Digital Systems
IEEE Micro
Error Recovery in Shared Memory Multiprocessors Using Private Caches
IEEE Transactions on Parallel and Distributed Systems
Analyzing heap error behavior in embedded JVM environments
Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Increasing Register File Immunity to Transient Errors
Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Compiler-guided register reliability improvement against soft errors
Proceedings of the 5th ACM international conference on Embedded software
Object duplication for improving reliability
ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Runtime integrity checking for inter-object connections
ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
Compiler-Directed Variable Latency Aware SPM Management to CopeWith Timing Problems
Proceedings of the International Symposium on Code Generation and Optimization
Hi-index | 14.98 |
Cache data errors read by a processor may cause CPU control flow error and force the system to enter a CPU-cache reintegration process in redundant processor systems. The reintegration process degrades the system performance and reliability. To reduce the occurrences of such an event, we propose a real-time error recovery scheme that provides effective fault-containment for data errors in cache memories. The scheme is based on cache data broadcasting of a dirty line after modification. It effectively exploits the redundancy of a fault-tolerant system using hardware voting. The scheme recovers from erroneous cache data written by a processor with full coverage. This error recovery feature remedies the insufficiency of error-correcting codes that are unable to prevent such an error. In addition, more than 60 percent of cache lines are fully covered for recovery due to errors originated from the cache itself, including unrecoverable ECC errors. The protocol can also be used to speedup the CPU-cache reintegration process for a temporarily failed processor. The performance overhead of the protocol is to broadcast only 2-3 percent of the total memory references.