A realistic evaluation of memory hardware errors and software system susceptibility

Authors:
Xin Li;Michael C. Huang;Kai Shen;Lingkun Chu
Affiliations:
University of Rochester;University of Rochester;University of Rochester;Ask.com
Venue:
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
Year:
2010

Citing 28
Cited 15

The Reliability of Single-Error Protected Computer Memories

IEEE Transactions on Computers
Efficient data breakpoints

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
FERRARI: A Flexible Software-Based Fault and Error Injection System

IEEE Transactions on Computers - Special issue on fault-tolerant computing
IBM experiments in soft fails in computer electronics (1978–1994)

IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Terrestrial cosmic rays

IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Field testing for cosmic ray soft errors in semiconductor memories

IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Accelerated testing for cosmic soft-error rate

IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers

IEEE Transactions on Software Engineering
Impact of Deep Submicron Technology on Dependability of VLSI Circuits

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Networked Windows NT System Field Failure Data Analysis

PRDC '99 Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing
Using Memory Errors to Attack a Virtual Machine

SP '03 Proceedings of the 2003 IEEE Symposium on Security and Privacy
Basic Concepts and Taxonomy of Dependable and Secure Computing

IEEE Transactions on Dependable and Secure Computing
Susceptibility of Commodity Systems and Software to Memory Soft Errors

IEEE Transactions on Computers
Automating Software Failure Reporting

Queue - System Failures
SafeMem: Exploiting ECC-Memory for Detecting Memory Leaks and Memory Corruption During Production Runs

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Assessing Fault Sensitivity in MPI Applications

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Soft Errors in Advanced Computer Systems

IEEE Design & Test
Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation

IEEE Micro
Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Enhancing server availability and security through failure-oblivious computing

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Reliability and Performance of Error-Correcting Memory and Register Arrays

IEEE Transactions on Computers
Understanding the propagation of hard errors to software and implications for resilient system design

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Samurai: protecting critical data in unsafe languages

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
A memory soft error measurement on production systems

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Reference-driven performance anomaly identification

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
b-adjacent error correction

IBM Journal of Research and Development

A rising tide lifts all boats: how memory error prediction and prevention can help with virtualized system longevity

HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Stealth works: emulating memory errors

RV'10 Proceedings of the First international conference on Runtime verification
Modeling and tolerating heterogeneous failures in large parallel systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
System implications of memory reliability in exascale computing

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Cooperative Application/OS DRAM fault recovery

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
LOT-ECC: localized and tiered reliability mechanisms for commodity memory systems

Proceedings of the 39th Annual International Symposium on Computer Architecture
Practical hardening of crash-tolerant systems

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
MAGE: adaptive granularity and ECC for resilient and power efficient memory systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A study of DRAM failures in the field

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Towards transparent hardening of distributed systems

Proceedings of the 9th Workshop on Hot Topics in Dependable Systems
E3CC: A memory error protection scheme with novel address mapping for subranked and low-power memories

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Memory hardware reliability is an indispensable part of whole-system dependability. This paper presents the collection of realistic memory hardware error traces (including transient and non-transient errors) from production computer systems with more than 800GB memory for around nine months. Detailed information on the error addresses allows us to identify patterns of single-bit, row, column, and whole-chip memory errors. Based on the collected traces, we explore the implications of different hardware ECC protection schemes so as to identify the most common error causes and approximate error rates exposed to the software level. Further, we investigate the software system susceptibility to major error causes, with the goal of validating, questioning, and augmenting results of prior studies. In particular, we find that the earlier result that most memory hardware errors do not lead to incorrect software execution may not be valid, due to the unrealistic model of exclusive transient errors. Our study is based on an efficient memory error injection approach that applies hardware watchpoints on hotspot memory regions.