System RAS implications of DRAM soft errors

Authors:
T. J. Dell
Affiliations:
IBM Systems and Technology Group, Essex Junction, Vermont
Venue:
IBM Journal of Research and Development
Year:
2008

Citing 11
Cited 5

Fault-tolerance design of the IBM Enterprise System/9000 Type 9021 processors

IBM Journal of Research and Development
IBM experiments in soft fails in computer electronics (1978–1994)

IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Field testing for cosmic ray soft errors in semiconductor memories

IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Terrestrial cosmic ray intensities

IBM Journal of Research and Development
Symbol error correcting codes for memory applications

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
SEMM-2: a new generation of single-event-effect modeling tools

IBM Journal of Research and Development
New simulation methodology for effects of radiation in semiconductor chip structures

IBM Journal of Research and Development
Circuit design and modeling for soft errors

IBM Journal of Research and Development
Fault-tolerant design of the IBM pSeries 690 system using POWER4 processor technology

IBM Journal of Research and Development
RAS design for the IBM eServer z900

IBM Journal of Research and Development
Blue Gene/L compute chip: memory and Ethernet subsystem

IBM Journal of Research and Development

Virtualized and flexible ECC for main memory

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
IBM system z10 design for RAS

IBM Journal of Research and Development
Review: A survey of memory error correcting techniques for improved reliability

Journal of Network and Computer Applications
System implications of memory reliability in exascale computing

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

While attention in the realm of computer design has shifted away from the classic DRAM soft-error rate (SER) and focused instead on SRAM and microprocessor latch sensitivities as sources of potential errors, DRAM SER nonetheless remains a challenging problem. This is true even though both cosmic ray-induced and alpha-particle-induced DRAM soft errors have been well modeled and, to a certain degree, well understood. However, the often-overlooked alignment of a DRAM hard error and a random soft error can have major reliability, availability, and serviceability (RAS) implications for systems that require an extremely long mean time between failures. The net of this effect is that what appears to be a well-behaved, single-bit soft error ends up overwhelming a seemingly state-of-the-art mitigation technique. This paper describes some of the history of DRAM soft-error discovery and the subsequent development of mitigation strategies. It then examines some architectural considerations that can exacerbate the effect of DRAM soft errors and may have system-level implications for today's standard fault-tolerance schemes.