Fault-Containment in Cache Memories for TMR Redundant Processor Systems

Authors:
Chung-Ho Chen;Arun K. Somani
Affiliations:
National Yunlin Univ. of Science and Technology, Touliu, Taiwan;Iowa State Univ., Ames
Venue:
IEEE Transactions on Computers
Year:
1999

Citing 5
Cited 7

Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing

Computer
Reliable computer systems (2nd ed.): design and evaluation

Reliable computer systems (2nd ed.): design and evaluation
DEPEND: A Simulation-Based Environment for System Level Dependability Analysis

IEEE Transactions on Computers
Transient Fault Tolerance in Digital Systems

IEEE Micro
Error Recovery in Shared Memory Multiprocessors Using Private Caches

IEEE Transactions on Parallel and Distributed Systems

Analyzing heap error behavior in embedded JVM environments

Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Increasing Register File Immunity to Transient Errors

Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Compiler-guided register reliability improvement against soft errors

Proceedings of the 5th ACM international conference on Embedded software
Object duplication for improving reliability

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Runtime integrity checking for inter-object connections

ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
Compiler-Directed Variable Latency Aware SPM Management to CopeWith Timing Problems

Proceedings of the International Symposium on Code Generation and Optimization
SpiNNaker: Fault tolerance in a power- and area- constrained large-scale neuromimetic architecture

Parallel Computing

Quantified Score

Hi-index	14.98

Visualization

Abstract

Cache data errors read by a processor may cause CPU control flow error and force the system to enter a CPU-cache reintegration process in redundant processor systems. The reintegration process degrades the system performance and reliability. To reduce the occurrences of such an event, we propose a real-time error recovery scheme that provides effective fault-containment for data errors in cache memories. The scheme is based on cache data broadcasting of a dirty line after modification. It effectively exploits the redundancy of a fault-tolerant system using hardware voting. The scheme recovers from erroneous cache data written by a processor with full coverage. This error recovery feature remedies the insufficiency of error-correcting codes that are unable to prevent such an error. In addition, more than 60 percent of cache lines are fully covered for recovery due to errors originated from the cache itself, including unrecoverable ECC errors. The protocol can also be used to speedup the CPU-cache reintegration process for a temporarily failed processor. The performance overhead of the protocol is to broadcast only 2-3 percent of the total memory references.