Soft error benchmarking of L2 caches with PARMA

Authors:
Jinho Suh;Mehrtash Manoochehri;Murali Annavaram;Michel Dubois
Affiliations:
University of Southern California, Los Angeles, CA, USA;University of Southern California, Los Angeles, CA, USA;University of Southern California, Los Angeles, CA, USA;University of Southern California, Los Angeles, CA, USA
Venue:
ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Year:
2011

Citing 16
Cited 1

Drowsy caches: simple techniques for reducing leakage power

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Using SimPoint for accurate and efficient simulation

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling

Proceedings of the 30th annual international symposium on Computer architecture
A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Cache Scrubbing in Microprocessors: Myth or Necessity?

PRDC '04 Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC'04)
Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor

Proceedings of the 31st annual international symposium on Computer architecture
An Accurate Analysis of the Effects of Soft Errors in the Instruction and Data Caches of a Pipelined Microprocessor

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Computing Architectural Vulnerability Factors for Address-Based Structures

Proceedings of the 32nd annual international symposium on Computer Architecture
SoftArch: An Architecture Level Tool for Modeling and Analyzing Soft Errors

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Vulnerability analysis of L2 cache elements to single event upsets

Proceedings of the conference on Design, automation and test in Europe: Proceedings
Reducing Data Cache Susceptibility to Soft Errors

IEEE Transactions on Dependable and Secure Computing
Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Reducing cache power with low-cost, multi-bit error-correcting codes

Proceedings of the 37th annual international symposium on Computer architecture
CPPC: correctable parity protected cache

Proceedings of the 38th annual international symposium on Computer architecture
Cross-layer resilience using wearout aware design flow

DSN '11 Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks

Reliability challenges for electric vehicles: from devices to architecture and systems software

Proceedings of the 50th Annual Design Automation Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

The amount of charge stored in an SRAM cell shrinks rapidly with each technology generation thus increasingly exposing caches to soft errors. Benchmarking the FIT rate of caches due to soft errors is critical to evaluate the relative merits of a plethora of protection schemes that are being proposed to protect against soft errors. The benchmarking of cache reliability introduces a unique challenge as compared to internal processor storage structures, such as the load/store queue. In the case of internal processor structures the time a data bit resides in the structure is so short that it is generally safe to assume that no more than one soft error strike can occur. Thus the reliability of such structures is overwhelmingly dominated by single bit errors. By contrast, a memory block may reside for millions of cycles in a last level cache. In this case it is important to consider the impact of the spatial and temporal distribution of multiple errors within the lifetime of a cache block in the presence of error protection. This paper introduces a unified reliability benchmarking framework called PARMA (Precise Analytical Reliability Model for Architecture). PARMA is a rigorous analytical framework that accurately accounts for the distribution of multiple errors to measure the failure rate under any protection scheme. In a single simulation run PARMA provides a precise FIT rate (expected number of failures in one billion hours) measurement for storage structures where the effect of multiple errors cannot be neglected. We have implemented the PARMA framework on top of a cycle-accurate out-of-order processor simulator (sim-outorder) to benchmark L2 cache failure rates for a set of CPU 2000 benchmarks. The effectiveness of three protection schemes are compared in terms of L2 cache FIT rate: parity, word-level Single Error Correcting Double Error Detecting (SECDED) code and block-level SECDED. Exploiting the accuracy of PARMA, we demonstrate that current techniques to evaluate cache FIT rates in the presence of SECDED, such as accelerated fault injection simulations and first-principle derivations based on Architectural Vulnerability Factor (AVF), can overestimate FIT rates by vast amounts. Based on the insights gained during this research we also introduce a new approximate analytical model that can quickly and more accurately estimate cache FIT rate in the presence of SECDED.