A framework for correction of multi-bit soft errors in L2 caches based on redundancy

  • Authors:
  • Koustav Bhattacharya;Nagarajan Ranganathan;Soontae Kim

  • Affiliations:
  • Department of Computer Science and Engineering, University of South Florida, Tampa, FL;Department of Computer Science and Engineering, University of South Florida, Tampa, FL;School of Engineering at Information and Communication University, Daejeon, Korea

  • Venue:
  • IEEE Transactions on Very Large Scale Integration (VLSI) Systems
  • Year:
  • 2009

Quantified Score

Hi-index 0.01

Visualization

Abstract

With the continuous decrease in the minimum feature size and increase in the chip density due to technology scaling, on-chip L2 caches are becoming increasingly susceptible to multi-bit soft errors. The increase in multi-bit errors could lead to higher risk of data corruption and potentially result in the crashing of application programs. Traditionally, the L2 caches have been protected from soft errors using techniques such as: 1) error detection/correction codes; 2) physical interleaving of cache bit lines to convert multi-bit errors into single-bit errors; and 3) cache scrubbing. While the first two methods incur large area overheads for multi-bit errors, identifying the time interval for scrubbing could be tricky. In this paper, we investigate in detail the multi-bit soft error rates in large L2 caches and propose a framework of solutions for their correction based on the amount of redundancy present in the memory hierarchy. We investigate several new techniques for reducing multi-bit errors in large L2 caches, in which, the multi-bit errors are detected using simple error detection codes and corrected using the data redundancy in the memory hierarchy. We also propose several techniques to control/mine the redundancy in the memory hierarchy to further improve the reliability of the L2 cache. The proposed techniques were implemented in the Simplescalar framework and validated using the SPEC 2000 integer and floating point benchmarks for L2 cache vulnerability, global cache miss-rate, average cycle count and main memory write back rate, considering the area and power overheads. Experimental results indicate that the vulnerability of L2 caches can be decreased by 40% on the average for integer benchmarks and 32% on the average for floating point benchmarks, with an average multi-bit error coverage of about 96%, with significantly less area and power overheads and with virtually no performance penalty. The proposed techniques are applicable to both single and multi-core processor-based systems.