DRAM errors in the wild: a large-scale field study

Authors:
Bianca Schroeder;Eduardo Pinheiro;Wolf-Dietrich Weber
Affiliations:
University of Toronto, Toronto, Canada;Google Inc., Mountain View, CA;Google Inc., Mountain View, CA
Venue:
Communications of the ACM
Year:
2011

Citing 13
Cited 6

Field testing for cosmic ray soft errors in semiconductor memories

IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Increasing relevance of memory hardware errors: a case for recoverable programming models

EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
Simulation based analysis of temperature effect on the faulty behavior of embedded DRAMs

Proceedings of the IEEE International Test Conference 2001
Cache Scrubbing in Microprocessors: Myth or Necessity?

PRDC '04 Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC'04)
Susceptibility of Commodity Systems and Software to Memory Soft Errors

IEEE Transactions on Computers
The Soft Error Problem: An Architectural Perspective

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Soft Errors in Advanced Computer Systems

IEEE Design & Test
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
A memory soft error measurement on production systems

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Error-correcting codes for semiconductor memory applications: a state-of-the-art review

IBM Journal of Research and Development

System implications of memory reliability in exascale computing

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Cooperative Application/OS DRAM fault recovery

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Evaluating operating system vulnerability to memory errors

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
kMemvisor: flexible system wide memory mirroring in virtual environments

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Using unreliable virtual hardware to inject errors in extreme-scale systems

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	48.22

Visualization

Abstract

Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of dual in-line memory module (DIMM) days. The goal of this paper is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology, and DIMM age? We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000--70,000 errors per billion device hours per Mb and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account. Finally, unlike commonly feared, we do not observe any indication that newer generations of DIMMs have worse error behavior.