Field testing for cosmic ray soft errors in semiconductor memories
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Increasing relevance of memory hardware errors: a case for recoverable programming models
EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
An Experimental Study of Security Vulnerabilities Caused by Errors
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Simulation based analysis of temperature effect on the faulty behavior of embedded DRAMs
Proceedings of the IEEE International Test Conference 2001
Using Memory Errors to Attack a Virtual Machine
SP '03 Proceedings of the 2003 IEEE Symposium on Security and Privacy
Cache Scrubbing in Microprocessors: Myth or Necessity?
PRDC '04 Proceedings of the 10th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC'04)
The Soft Error Problem: An Architectural Perspective
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Soft Errors in Advanced Computer Systems
IEEE Design & Test
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Interpreting the data: Parallel analysis with Sawzall
Scientific Programming - Dynamic Grids and Worldwide Computing
Bigtable: a distributed storage system for structured data
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
A memory soft error measurement on production systems
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
The case for RAMClouds: scalable high-performance storage entirely in DRAM
ACM SIGOPS Operating Systems Review
Virtualized and flexible ECC for main memory
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Characterizing cloud computing hardware reliability
Proceedings of the 1st ACM symposium on Cloud computing
Rethinking DRAM design and organization for energy-constrained multi-cores
Proceedings of the 37th annual international symposium on Computer architecture
IVEC: off-chip memory integrity protection for both security and reliability
Proceedings of the 37th annual international symposium on Computer architecture
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Distributed Diskless Checkpoint for Large Scale Systems
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
End-to-end data integrity for file systems: a ZFS case study
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Tolerating file-system mistakes with EnvyFS
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
A realistic evaluation of memory hardware errors and software system susceptibility
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Reliable data-center scale computations
Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
Wimpy node clusters: what about non-wimpy workloads?
Proceedings of the Sixth International Workshop on Data Management on New Hardware
DRAM errors in the wild: a large-scale field study
Communications of the ACM
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Availability in globally distributed storage systems
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Stealth works: emulating memory errors
RV'10 Proceedings of the First international conference on Runtime verification
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs
Proceedings of the sixth conference on Computer systems
Paxos replicated state machines as the basis of a high-performance data store
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Considering GPGPU for HPC centers: is it worth the effort?
Facing the multicore-challenge
Aspects of data-intensive cloud computing
From active data management to event-based systems and more
Considering GPGPU for HPC centers: is it worth the effort?
Facing the multicore-challenge
Warding off the dangers of data corruption with amulet
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Sampling + DMR: practical and low-overhead permanent fault detection
Proceedings of the 38th annual international symposium on Computer architecture
Adaptive granularity memory systems: a tradeoff between storage efficiency and throughput
Proceedings of the 38th annual international symposium on Computer architecture
Understanding network failures in data centers: measurement, analysis, and implications
Proceedings of the ACM SIGCOMM 2011 conference
Modeling and synthesizing task placement constraints in Google compute clusters
Proceedings of the 2nd ACM Symposium on Cloud Computing
FTI: high performance fault tolerance interface for hybrid systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Modeling and tolerating heterogeneous failures in large parallel systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Towards reliable storage systems
Towards reliable storage systems
Proactive process-level live migration and back migration in HPC environments
Journal of Parallel and Distributed Computing
Improving System Energy Efficiency with Memory Rank Subsetting
ACM Transactions on Architecture and Code Optimization (TACO)
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Foundations and Trends in Databases
The search for energy-efficient building blocks for the data center
ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
A tunable, software-based DRAM error detection and correction library for HPC
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Fault-tolerant complex event processing using customizable state machine-based operators
Proceedings of the 15th International Conference on Extending Database Technology
VM aware journaling: improving journaling file system performance in virtualization environments
Software—Practice & Experience
BOOM: enabling mobile memory based low-power server DIMMs
Proceedings of the 39th Annual International Symposium on Computer Architecture
Proceedings of the 39th Annual International Symposium on Computer Architecture
LOT-ECC: localized and tiered reliability mechanisms for commodity memory systems
Proceedings of the 39th Annual International Symposium on Computer Architecture
Practical hardening of crash-tolerant systems
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Fast online error detection and correction with thread signature calculae
Microprocessors & Microsystems
Software execution protection in the cloud
Proceedings of the 1st European Workshop on Dependable Cloud Computing
MAGE: adaptive granularity and ECC for resilient and power efficient memory systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A study of DRAM failures in the field
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Detection and correction of silent data corruption for large-scale high-performance computing
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
CACTI-IO: CACTI with off-chip power-area-timing models
Proceedings of the International Conference on Computer-Aided Design
Programming model extensions for resilience in extreme scale computing
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Augustus: scalable and robust storage for cloud applications
Proceedings of the 8th ACM European Conference on Computer Systems
Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
Server-class DDR3 SDRAM memory buffer chip
IBM Journal of Research and Development
Robustness in the Salus scalable block store
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Tri-level-cell phase change memory: toward an efficient and reliable memory system
Proceedings of the 40th Annual International Symposium on Computer Architecture
Using dark fiber to displace diesel generators
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures
ACM Transactions on Storage (TOS)
Accurate and effective algorithm for estimating the reliability of digital combinational circuits
Proceedings of the 46th Annual Simulation Symposium
Exploring DRAM organizations for energy-efficient and resilient exascale memories
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Low-power, low-storage-overhead chipkill correct via multi-line error correction
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Supercomputing with commodity CPUs: are mobile SoCs ready for HPC?
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Limplock: understanding the impact of limpware on scale-out cloud systems
Proceedings of the 4th annual Symposium on Cloud Computing
When the network crumbles: an empirical study of cloud network failures and their impact on services
Proceedings of the 4th annual Symposium on Cloud Computing
Self-stabilizing iterative solvers
ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
ACM Transactions on Architecture and Code Optimization (TACO)
Ffsck: The Fast File-System Checker
ACM Transactions on Storage (TOS)
Efficient online memory error assessment and circumvention for Linux with RAMpage
International Journal of Critical Computer-Based Systems
Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems
Scientific Programming - Selected Papers from Super Computing 2012
Ffsck: the fast file system checker
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
HARDFS: hardening HDFS with selective and lightweight versioning
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Hi-index | 0.02 |
Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days. The goal of this paper is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology and DIMM age? We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account. Finally, unlike commonly feared, we don't observe any indication that newer generations of DIMMs have worse error behavior.