Mitigating the effects of large multiple cell upsets (MCUs) in memories

  • Authors:
  • Juan Antonio Maestro;Pedro Reviriego;Sanghyeon Baeg;Shijie Wen;Richard Wong

  • Affiliations:
  • Universidad Antonio de Nebrija, Madrid, Spain;Universidad Antonio de Nebrija, Madrid, Spain;Hanyang University, Kyung-Gi-Do, Korea;Cisco Systems, San Jose, CA;Cisco Systems, San Jose, CA

  • Venue:
  • ACM Transactions on Design Automation of Electronic Systems (TODAES)
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Reliability is a critical issue for memories. Radiation particles that hit the device can cause errors in some cells, which can lead to data corruption. To avoid this problem, memories are protected with per-word error correction codes (ECCs). Typically, single-error correction and double-error detection (SEC-DED) codes are used. As technology scales, errors caused by radiation particles on memories tend to affect more than one cell—what is known as a multiple cell upset (MCU). To ensure that only a single cell is affected in each word, interleaving is used. With interleaving, cells that belong to the same word are placed at a sufficient distance such that an MCU will only affect a single cell on each word. The use of interleaving significantly increases the cost of the device. Also, determining the interleaving distance (ID) required to avoid MCUs causing double errors is not trivial. Typically, accelerated radiation experiments with a limited number of particle hits are used. They provide a lower bound on the required ID, but larger MCUs may occur with a low probability. But even if the percentage of such large MCUs is very low, the impact on reliability can be significant. This article presents a technique to mitigate the effects of large MCUs that is, those that exceed the ID, on memory reliability. The proposed approach is able to correct most double errors caused by large MCUs by exploiting the locality of the errors within an MCU.