System RAS implications of DRAM soft errors

  • Authors:
  • T. J. Dell

  • Affiliations:
  • IBM Systems and Technology Group, Essex Junction, Vermont

  • Venue:
  • IBM Journal of Research and Development
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

While attention in the realm of computer design has shifted away from the classic DRAM soft-error rate (SER) and focused instead on SRAM and microprocessor latch sensitivities as sources of potential errors, DRAM SER nonetheless remains a challenging problem. This is true even though both cosmic ray-induced and alpha-particle-induced DRAM soft errors have been well modeled and, to a certain degree, well understood. However, the often-overlooked alignment of a DRAM hard error and a random soft error can have major reliability, availability, and serviceability (RAS) implications for systems that require an extremely long mean time between failures. The net of this effect is that what appears to be a well-behaved, single-bit soft error ends up overwhelming a seemingly state-of-the-art mitigation technique. This paper describes some of the history of DRAM soft-error discovery and the subsequent development of mitigation strategies. It then examines some architectural considerations that can exacerbate the effect of DRAM soft errors and may have system-level implications for today's standard fault-tolerance schemes.