Reliability-Driven ECC Allocation for Multiple Bit Error Resilience in Processor Cache

  • Authors:
  • Somnath Paul;Fang Cai;Xinmiao Zhang;Swarup Bhunia

  • Affiliations:
  • Case Western Reserve University, Cleveland;Case Western Reserve University, Cleveland;Case Western Reserve University, Cleveland;Case Western Reserve University, Cleveland

  • Venue:
  • IEEE Transactions on Computers
  • Year:
  • 2011

Quantified Score

Hi-index 14.98

Visualization

Abstract

With increasing parameter variations in nanometer technologies, on-chip cache in processor is becoming highly vulnerable to runtime failures induced by “soft error,” voltage, or thermal noise and aging effects. Nondeterministic and unreliable memory operation due to these runtime failures can be addressed by: 1) designing the memory for worst-case scenarios and/or 2) runtime error detection and correction. Worst-case guard-banding can lead to overly pessimistic results for cell footprint and power. On the other hand, conventional error correcting code (ECC) used in processor cache has very limited correction capability, making it insufficient to protect memory in scaled technologies (sub-45 nm), which are vulnerable to multiple-bit failures in a word (64-bit). The requirement to tolerate multibit failures is accentuated with supply voltage scaling for low-power operation. We note that due to inter and intra-die parameter variations, different memory blocks move to different reliability corners. A uniform ECC protection for all memory blocks fails to account for the distribution of vulnerability across memory blocks. On the other hand, it can lead to overly pessimistic results if the worst-case vulnerability of a memory block is accounted for during ECC allocation. In this paper, we propose a reliability-driven ECC allocation scheme that matches the relative vulnerability of a memory block (determined using postfabrication characterization) with appropriate ECC protection. We achieve postfabrication variable ECC allocation by storing the check bits in the “ways” of an associative cache. We use shortened Bose-Chaudhuri-Hocquenghem (BCH) cyclic code with zero padding, which provides high random error correction capability with modest amount of check bits. Moreover, we propose efficient circuit/architecture-level optimizations of the ECC encoding/decoding logic to minimize the impact on area, performance, and energy. Simulation results for SPEC2000 benchmarks show that such a variable ECC scheme tolerates high failure rates with negligible performance (four percent) and area (0.2 percent) penalty.