Soft error benchmarking of L2 caches with PARMA

  • Authors:
  • Jinho Suh;Mehrtash Manoochehri;Murali Annavaram;Michel Dubois

  • Affiliations:
  • University of Southern California, Los Angeles, CA, USA;University of Southern California, Los Angeles, CA, USA;University of Southern California, Los Angeles, CA, USA;University of Southern California, Los Angeles, CA, USA

  • Venue:
  • ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The amount of charge stored in an SRAM cell shrinks rapidly with each technology generation thus increasingly exposing caches to soft errors. Benchmarking the FIT rate of caches due to soft errors is critical to evaluate the relative merits of a plethora of protection schemes that are being proposed to protect against soft errors. The benchmarking of cache reliability introduces a unique challenge as compared to internal processor storage structures, such as the load/store queue. In the case of internal processor structures the time a data bit resides in the structure is so short that it is generally safe to assume that no more than one soft error strike can occur. Thus the reliability of such structures is overwhelmingly dominated by single bit errors. By contrast, a memory block may reside for millions of cycles in a last level cache. In this case it is important to consider the impact of the spatial and temporal distribution of multiple errors within the lifetime of a cache block in the presence of error protection. This paper introduces a unified reliability benchmarking framework called PARMA (Precise Analytical Reliability Model for Architecture). PARMA is a rigorous analytical framework that accurately accounts for the distribution of multiple errors to measure the failure rate under any protection scheme. In a single simulation run PARMA provides a precise FIT rate (expected number of failures in one billion hours) measurement for storage structures where the effect of multiple errors cannot be neglected. We have implemented the PARMA framework on top of a cycle-accurate out-of-order processor simulator (sim-outorder) to benchmark L2 cache failure rates for a set of CPU 2000 benchmarks. The effectiveness of three protection schemes are compared in terms of L2 cache FIT rate: parity, word-level Single Error Correcting Double Error Detecting (SECDED) code and block-level SECDED. Exploiting the accuracy of PARMA, we demonstrate that current techniques to evaluate cache FIT rates in the presence of SECDED, such as accelerated fault injection simulations and first-principle derivations based on Architectural Vulnerability Factor (AVF), can overestimate FIT rates by vast amounts. Based on the insights gained during this research we also introduce a new approximate analytical model that can quickly and more accurately estimate cache FIT rate in the presence of SECDED.