Modeling and improving data cache reliability: 1

Authors:
Ismail Kadayif;Mahmut Kandemir
Affiliations:
Canakkale Onsekiz Mart University;The Pennsylvania State University
Venue:
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Year:
2007

Citing 20
Cited 1

Terrestrial cosmic rays

IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Fault-tolerant computer system design

Fault-tolerant computer system design
Transient fault detection via simultaneous multithreading

Proceedings of the 27th annual international symposium on Computer architecture
Eager writeback - a technique for improving bandwidth utilization

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Cache decay: exploiting generational behavior to reduce cache leakage power

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Transient-fault recovery using simultaneous multithreading

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Dual use of superscalar datapath for transient-fault detection and recovery

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Soft Error Sensitivity Characterization for Microprocessor Dependability Enhancement Strategy

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Analyzing Soft Errors in Leakage Optimized SRAM Design

VLSID '03 Proceedings of the 16th International Conference on VLSI Design
Transient-fault recovery for chip multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor

Proceedings of the 31st annual international symposium on Computer architecture
Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Characterization of Soft Errors Caused by Single Event Upsets in CMOS Processes

IEEE Transactions on Dependable and Secure Computing
The Soft Error Problem: An Architectural Perspective

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Opportunistic Transient-Fault Detection

Proceedings of the 32nd annual international symposium on Computer Architecture
SoftArch: An Architecture Level Tool for Modeling and Analyzing Soft Errors

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Reducing Data Cache Susceptibility to Soft Errors

IEEE Transactions on Dependable and Secure Computing
Soft errors issues in low-power caches

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Modeling soft errors for data caches and alleviating their effects on data reliability

Microprocessors & Microsystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Soft errors arising from energetic particle strikes pose a significant reliability concern for computing systems, especially for those running in noisy environments. Technology scaling and aggressive leakage control mechanisms make the problem caused by these transient errors even more severe. Therefore, it is very important to employ reliability enhancing mechanisms in processor/memory designs to protect them against soft errors. To do so, we first need to model soft errors, and then study cost/reliability tradeoffs among various reliability enhancing techniques based on the model so that system requirements could be met. Since cache memories take the largest fraction of on-chip real estate today and their share is expected to continue to grow in future designs, they are more vulnerable to soft errors, as compared to many other components of a computing system. In this paper, we first focus on a soft error model for L1 data caches, and then explore different reliability enhancing mechanisms. More specifically, we define a metric called AVFC (Architectural Vulnerability Factor for Caches), which represents the probability with which a fault in the cache can be visible in the final output of the program. Based on this model, we then propose three architectural schemes for improving reliability in the existence of soft errors. Our first scheme prevents an error from propagating to the lower levels in the memory hierarchy by not forwarding the unmodified data words of a dirty cache block to the L2 cache when the dirty block is to be replaced. The second scheme proposed selectively invalidates cache blocks to reduce their vulnerable periods, decreasing their chances of catching any soft errors. Based on the AVFC metric, our experimental results show that these two schemes are very effective in alleviating soft errors in the L1 data cache. Specifically, by using our first scheme, it is possible to improve the AVFC metric by 32% without any performance loss. On the other hand, the second scheme enhances the AVFC metric between 60% and 97%, at the cost of a performance degradation which varies from 0% to 21.3%, depending on how aggressively the cache blocks are invalidated. To reduce the performance overhead caused by cache block invalidation, we also propose a third scheme which tries to bring a fresh copy of the invalidated block into the cache via prefetching. Our experimental results indicate that, this scheme can reduce the performance overheads to less than 1% for all applications in our experimental suite, at the cost of giving up a tolerable portion of the reliability enhancement the second scheme achieves.