A framework for correction of multi-bit soft errors in L2 caches based on redundancy

Authors:
Koustav Bhattacharya;Nagarajan Ranganathan;Soontae Kim
Affiliations:
Department of Computer Science and Engineering, University of South Florida, Tampa, FL;Department of Computer Science and Engineering, University of South Florida, Tampa, FL;School of Engineering at Information and Communication University, Daejeon, Korea
Venue:
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Year:
2009

Citing 23
Cited 2

The SimpleScalar tool set, version 2.0

ACM SIGARCH Computer Architecture News
Area efficient architectures for information integrity in cache memories

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Eager writeback - a technique for improving bandwidth utilization

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Cache decay: exploiting generational behavior to reduce cache leakage power

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
Design Challenges of Technology Scaling

IEEE Micro
High Availability and Reliability in the Itanium Processor

IEEE Micro
Power4 System Design for High Reliability

IEEE Micro
A dynamic cache sub-block design to reduce false sharing

ICCD '95 Proceedings of the 1995 International Conference on Computer Design: VLSI in Computers and Processors
Speculative Versioning Cache

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Trends and Challenges in VLSI Circuit Reliability

IEEE Micro
Performance, energy, and reliability tradeoffs in replicating hot cache lines

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Cache Scrubbing in Microprocessors: Myth or Necessity?

PRDC '04 Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC'04)
Enhancing data cache reliability by the addition of a small fully-associative replication cache

Proceedings of the 18th annual international conference on Supercomputing
Robust System Design with Built-In Soft-Error Resilience

Computer
Computing Architectural Vulnerability Factors for Address-Based Structures

Proceedings of the 32nd annual international symposium on Computer Architecture
Reliability Concerns in Embedded System Designs

Computer
In-Register Duplication: Exploiting Narrow-Width Value for Improving Register File Reliability

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Reducing Data Cache Susceptibility to Soft Errors

IEEE Transactions on Dependable and Secure Computing
Balancing Performance and Reliability in the Memory Hierarchy

ISPASS '05 Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005
Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture

SimTag: exploiting tag bits similarity to improve the reliability of the data caches

Proceedings of the Conference on Design, Automation and Test in Europe
A low-cost, systematic methodology for soft error robustness of logic circuits

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

With the continuous decrease in the minimum feature size and increase in the chip density due to technology scaling, on-chip L2 caches are becoming increasingly susceptible to multi-bit soft errors. The increase in multi-bit errors could lead to higher risk of data corruption and potentially result in the crashing of application programs. Traditionally, the L2 caches have been protected from soft errors using techniques such as: 1) error detection/correction codes; 2) physical interleaving of cache bit lines to convert multi-bit errors into single-bit errors; and 3) cache scrubbing. While the first two methods incur large area overheads for multi-bit errors, identifying the time interval for scrubbing could be tricky. In this paper, we investigate in detail the multi-bit soft error rates in large L2 caches and propose a framework of solutions for their correction based on the amount of redundancy present in the memory hierarchy. We investigate several new techniques for reducing multi-bit errors in large L2 caches, in which, the multi-bit errors are detected using simple error detection codes and corrected using the data redundancy in the memory hierarchy. We also propose several techniques to control/mine the redundancy in the memory hierarchy to further improve the reliability of the L2 cache. The proposed techniques were implemented in the Simplescalar framework and validated using the SPEC 2000 integer and floating point benchmarks for L2 cache vulnerability, global cache miss-rate, average cycle count and main memory write back rate, considering the area and power overheads. Experimental results indicate that the vulnerability of L2 caches can be decreased by 40% on the average for integer benchmarks and 32% on the average for floating point benchmarks, with an average multi-bit error coverage of about 96%, with significantly less area and power overheads and with virtually no performance penalty. The proposed techniques are applicable to both single and multi-core processor-based systems.