A study of DRAM failures in the field

Authors:
Vilas Sridharan;Dean Liberty
Affiliations:
RAS Architecture AMD, Inc., Boxborough, MA;RAS Architecture AMD, Inc., Boxborough, MA
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 16
Cited 9

Impact of Deep Submicron Technology on Dependability of VLSI Circuits

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Trends and Challenges in VLSI Circuit Reliability

IEEE Micro
A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Cache Scrubbing in Microprocessors: Myth or Necessity?

PRDC '04 Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC'04)
Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor

Proceedings of the 31st annual international symposium on Computer architecture
Basic Concepts and Taxonomy of Dependable and Secure Computing

IEEE Transactions on Dependable and Secure Computing
Susceptibility of Commodity Systems and Software to Memory Soft Errors

IEEE Transactions on Computers
Soft Errors in Advanced Computer Systems

IEEE Design & Test
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
A memory soft error measurement on production systems

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Virtualized and flexible ECC for main memory

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Rethinking DRAM design and organization for energy-constrained multi-cores

Proceedings of the 37th annual international symposium on Computer architecture
A realistic evaluation of memory hardware errors and software system susceptibility

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
LOT-ECC: localized and tiered reliability mechanisms for commodity memory systems

Proceedings of the 39th Annual International Symposium on Computer Architecture

kMemvisor: flexible system wide memory mirroring in virtual environments

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
When is multi-version checkpointing needed?

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Resilient die-stacked DRAM caches

Proceedings of the 40th Annual International Symposium on Computer Architecture
Exploring DRAM organizations for energy-efficient and resilient exascale memories

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Low-power, low-storage-overhead chipkill correct via multi-line error correction

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Implicit-storing and redundant-encoding-of-attribute information in error-correction-codes

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Aging-aware hardware-software task partitioning for reliable reconfigurable multiprocessor systems

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most modern computer systems use dynamic random access memory (DRAM) as a main memory store. Recent publications have confirmed that DRAM errors are a common source of failures in the field. Therefore, further attention to the faults experienced by DRAM sub-systems is warranted. In this paper, we present a study of 11 months of DRAM errors in a large high-performance computing cluster. Our goal is to understand the failure modes, rates, and fault types experienced by DRAM in production settings. We identify several unique DRAM failure modes, including single-bit, multi-bit, and multi-chip failures. We also provide a deterministic bound on the rate of transient faults in the DRAM array, by exploiting the presence of a hardware scrubber on our nodes. We draw several conclusions from our study. First, DRAM failures are dominated by permanent, rather than transient, faults, although not to the extent found by previous publications. Second, DRAMs are susceptible to large multi-bit failures, such as failures that affect an entire DRAM row or column, indicating faults in shared internal circuitry. Third, we identify a DRAM failure mode that disrupts access to other DRAM devices that share the same board-level circuitry. Finally, we find that chipkill error-correcting codes (ECC) are extremely effective, reducing the node failure rate from uncorrected DRAM errors by 42x compared to single-error correct/double-error detect (SEC-DED) ECC.