Modeling the impact of permanent faults in caches

Authors:
Daniel Sánchez;Yiannakis Sazeides;Juan M. Cebrián;José M. García;Juan L. Aragón
Affiliations:
University of Murcia¹, Murcia, Spain;University of Cyprus, Nicosia, Cyprus;University of Murcia, Murcia, Spain;University of Murcia, Murcia, Spain;University of Murcia, Murcia, Spain
Venue:
ACM Transactions on Architecture and Code Optimization (TACO)
Year:
2013

Citing 33
Cited 0

Cache Memory Organization to Enhance the Yield of High Performance VLSI Processors

IEEE Transactions on Computers
An analytical cache model

ACM Transactions on Computer Systems (TOCS)
Evaluating Associativity in CPU Caches

IEEE Transactions on Computers
A Unified Negative-Binomial Distribution for Yield Analysis of Defect-Tolerant Circuits

IEEE Transactions on Computers
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Approximating block accesses in database organizations

Communications of the ACM
Simics: A Full System Simulation Platform

Computer
Design Challenges of Technology Scaling

IEEE Micro
Performance Implications of Tolerating Cache Faults

IEEE Transactions on Computers
Parameter variations and impact on circuits and microarchitecture

Proceedings of the 40th annual Design Automation Conference
Architecture of a VLSI instruction cache for a RISC

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
PADded Cache: A New Fault-Tolerance Technique for Cache Memories

VTS '99 Proceedings of the 1999 17TH IEEE VLSI Test Symposium
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
A cache-defect-aware code placement algorithm for improving the performance of processors

ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
Statistical analysis of SRAM cell stability

Proceedings of the 43rd annual Design Automation Conference
SPEC CPU2006 benchmark descriptions

ACM SIGARCH Computer Architecture News
Impact of Parameter Variations on Circuits and Microarchitecture

IEEE Micro
Performance of Graceful Degradation for Cache Faults

ISVLSI '07 Proceedings of the IEEE Computer Society Annual Symposium on VLSI
Impact of die-to-die and within-die parameter variations on the throughput distribution of multi-core processors

ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
On-Chip Cache Device Scaling Limits and Effective Fault Repair Techniques in Future Nanoscale Technology

DSD '07 Proceedings of the 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools
L2 Cache Modeling for Scientific Applications on Chip Multi-Processors

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
IBM POWER6 microarchitecture

IBM Journal of Research and Development
Trading off Cache Capacity for Reliability to Enable Low Voltage Operation

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Tolerating process variations in large, set-associative caches: The buddy cache

ACM Transactions on Architecture and Code Optimization (TACO)
Circuit techniques for dynamic variation tolerance

Proceedings of the 46th Annual Design Automation Conference
CMOS design near the limit of scaling

IBM Journal of Research and Development
Power-constrained CMOS scaling limits

IBM Journal of Research and Development
ZerehCache: armoring cache architectures in high defect density technologies

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
A resilience roadmap: (invited paper)

Proceedings of the Conference on Design, Automation and Test in Europe
DEFCAM: A design and evaluation framework for defect-tolerant cache memories

ACM Transactions on Architecture and Code Optimization (TACO)
An analytical model for the calculation of the Expected Miss Ratio in faulty caches

IOLTS '11 Proceedings of the 2011 IEEE 17th International On-Line Testing Symposium
Physically Justifiable Die-Level Modeling of Spatial Variation in View of Systematic Across Wafer Variability

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The traditional performance cost benefits we have enjoyed for decades from technology scaling are challenged by several critical constraints including reliability. Increases in static and dynamic variations are leading to higher probability of parametric and wear-out failures and are elevating reliability into a prime design constraint. In particular, SRAM cells used to build caches that dominate the processor area are usually minimum sized and more prone to failure. It is therefore of paramount importance to develop effective methodologies that facilitate the exploration of reliability techniques for caches. To this end, we present an analytical model that can determine for a given cache configuration, address trace, and random probability of permanent cell failure the exact expected miss rate and its standard deviation when blocks with faulty bits are disabled. What distinguishes our model is that it is fully analytical, it avoids the use of fault maps, and yet, it is both exact and simpler than previous approaches. The analytical model is used to produce the miss-rate trends (expected miss-rate) for future technology nodes for both uncorrelated and clustered faults. Some of the key findings based on the proposed model are (i) block disabling has a negligible impact on the expected miss-rate unless probability of failure is equal or greater than 2.6e-4, (ii) the fault map methodology can accurately calculate the expected miss-rate as long as 1,000 to 10,000 fault maps are used, and (iii) the expected miss-rate for execution of parallel applications increases with the number of threads and is more pronounced for a given probability of failure as compared to sequential execution.