DRAM errors in the wild: a large-scale field study

Authors:
Bianca Schroeder;Eduardo Pinheiro;Wolf-Dietrich Weber
Affiliations:
University of Toronto, Toronto, ON, Canada;Google Inc., Mountain View, CA, USA;Google Inc., Mountain View, CA, USA
Venue:
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Year:
2009

Citing 13
Cited 73

Field testing for cosmic ray soft errors in semiconductor memories

IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Increasing relevance of memory hardware errors: a case for recoverable programming models

EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
An Experimental Study of Security Vulnerabilities Caused by Errors

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Simulation based analysis of temperature effect on the faulty behavior of embedded DRAMs

Proceedings of the IEEE International Test Conference 2001
Using Memory Errors to Attack a Virtual Machine

SP '03 Proceedings of the 2003 IEEE Symposium on Security and Privacy
Cache Scrubbing in Microprocessors: Myth or Necessity?

PRDC '04 Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC'04)
The Soft Error Problem: An Architectural Perspective

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Soft Errors in Advanced Computer Systems

IEEE Design & Test
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
A memory soft error measurement on production systems

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference

The case for RAMClouds: scalable high-performance storage entirely in DRAM

ACM SIGOPS Operating Systems Review
Virtualized and flexible ECC for main memory

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Characterizing cloud computing hardware reliability

Proceedings of the 1st ACM symposium on Cloud computing
Rethinking DRAM design and organization for energy-constrained multi-cores

Proceedings of the 37th annual international symposium on Computer architecture
IVEC: off-chip memory integrity protection for both security and reliability

Proceedings of the 37th annual international symposium on Computer architecture
Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Distributed Diskless Checkpoint for Large Scale Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
End-to-end data integrity for file systems: a ZFS case study

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Tolerating file-system mistakes with EnvyFS

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
A realistic evaluation of memory hardware errors and software system susceptibility

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Reliable data-center scale computations

Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
Wimpy node clusters: what about non-wimpy workloads?

Proceedings of the Sixth International Workshop on Data Management on New Hardware
DRAM errors in the wild: a large-scale field study

Communications of the ACM
A rising tide lifts all boats: how memory error prediction and prevention can help with virtualized system longevity

HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Availability in globally distributed storage systems

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Stealth works: emulating memory errors

RV'10 Proceedings of the First international conference on Runtime verification
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs

Proceedings of the sixth conference on Computer systems
Paxos replicated state machines as the basis of a high-performance data store

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Considering GPGPU for HPC centers: is it worth the effort?

Facing the multicore-challenge
Aspects of data-intensive cloud computing

From active data management to event-based systems and more
Considering GPGPU for HPC centers: is it worth the effort?

Facing the multicore-challenge
Warding off the dangers of data corruption with amulet

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Sampling + DMR: practical and low-overhead permanent fault detection

Proceedings of the 38th annual international symposium on Computer architecture
Adaptive granularity memory systems: a tradeoff between storage efficiency and throughput

Proceedings of the 38th annual international symposium on Computer architecture
Understanding network failures in data centers: measurement, analysis, and implications

Proceedings of the ACM SIGCOMM 2011 conference
Modeling and synthesizing task placement constraints in Google compute clusters

Proceedings of the 2nd ACM Symposium on Cloud Computing
FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Modeling and tolerating heterogeneous failures in large parallel systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Towards reliable storage systems

Towards reliable storage systems
Proactive process-level live migration and back migration in HPC environments

Journal of Parallel and Distributed Computing
Improving System Energy Efficiency with Memory Rank Subsetting

ACM Transactions on Architecture and Code Optimization (TACO)
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Modern B-Tree Techniques

Foundations and Trends in Databases
The search for energy-efficient building blocks for the data center

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
A tunable, software-based DRAM error detection and correction library for HPC

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Fault-tolerant complex event processing using customizable state machine-based operators

Proceedings of the 15th International Conference on Extending Database Technology
Temperature management in data centers: why some (might) like it hot
VM aware journaling: improving journaling file system performance in virtualization environments

Software—Practice & Experience
BOOM: enabling mobile memory based low-power server DIMMs

Proceedings of the 39th Annual International Symposium on Computer Architecture
Euripus: a flexible unified hardware memory checkpointing accelerator for bidirectional-debugging and reliability

Proceedings of the 39th Annual International Symposium on Computer Architecture
LOT-ECC: localized and tiered reliability mechanisms for commodity memory systems

Proceedings of the 39th Annual International Symposium on Computer Architecture
Practical hardening of crash-tolerant systems

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Fast online error detection and correction with thread signature calculae

Microprocessors & Microsystems
Software execution protection in the cloud

Proceedings of the 1st European Workshop on Dependable Cloud Computing
MAGE: adaptive granularity and ECC for resilient and power efficient memory systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A study of DRAM failures in the field

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Detection and correction of silent data corruption for large-scale high-performance computing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
CACTI-IO: CACTI with off-chip power-area-timing models

Proceedings of the International Conference on Computer-Aided Design
Programming model extensions for resilience in extreme scale computing

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Augustus: scalable and robust storage for cloud applications

Proceedings of the 8th ACM European Conference on Computer Systems
Characterizing the impact of process variation on write endurance enhancing techniques for non-volatile memory systems

Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
Server-class DDR3 SDRAM memory buffer chip

IBM Journal of Research and Development
Robustness in the Salus scalable block store

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Tri-level-cell phase change memory: toward an efficient and reliable memory system

Proceedings of the 40th Annual International Symposium on Computer Architecture
Using dark fiber to displace diesel generators

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures

ACM Transactions on Storage (TOS)
Accurate and effective algorithm for estimating the reliability of digital combinational circuits

Proceedings of the 46th Annual Simulation Symposium
Exploring DRAM organizations for energy-efficient and resilient exascale memories

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Low-power, low-storage-overhead chipkill correct via multi-line error correction

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Supercomputing with commodity CPUs: are mobile SoCs ready for HPC?

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Limplock: understanding the impact of limpware on scale-out cloud systems

Proceedings of the 4th annual Symposium on Cloud Computing
When the network crumbles: an empirical study of cloud network failures and their impact on services

Proceedings of the 4th annual Symposium on Cloud Computing
Self-stabilizing iterative solvers

ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
E3CC: A memory error protection scheme with novel address mapping for subranked and low-power memories

ACM Transactions on Architecture and Code Optimization (TACO)
Ffsck: The Fast File-System Checker

ACM Transactions on Storage (TOS)
Efficient online memory error assessment and circumvention for Linux with RAMpage

International Journal of Critical Computer-Based Systems
Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems

Scientific Programming - Selected Papers from Super Computing 2012
Ffsck: the fast file system checker

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
HARDFS: hardening HDFS with selective and lightweight versioning

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.02

Visualization

Abstract

Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days. The goal of this paper is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology and DIMM age? We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account. Finally, unlike commonly feared, we don't observe any indication that newer generations of DIMMs have worse error behavior.