The Reliability of Single-Error Protected Computer Memories
IEEE Transactions on Computers
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
FERRARI: A Flexible Software-Based Fault and Error Injection System
IEEE Transactions on Computers - Special issue on fault-tolerant computing
IBM experiments in soft fails in computer electronics (1978–1994)
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Field testing for cosmic ray soft errors in semiconductor memories
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Accelerated testing for cosmic soft-error rate
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Improving cluster availability using workstation validation
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers
IEEE Transactions on Software Engineering
Impact of Deep Submicron Technology on Dependability of VLSI Circuits
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Networked Windows NT System Field Failure Data Analysis
PRDC '99 Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing
Using Memory Errors to Attack a Virtual Machine
SP '03 Proceedings of the 2003 IEEE Symposium on Security and Privacy
Basic Concepts and Taxonomy of Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing
Susceptibility of Commodity Systems and Software to Memory Soft Errors
IEEE Transactions on Computers
Automating Software Failure Reporting
Queue - System Failures
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Assessing Fault Sensitivity in MPI Applications
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Soft Errors in Advanced Computer Systems
IEEE Design & Test
Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Enhancing server availability and security through failure-oblivious computing
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Reliability and Performance of Error-Correcting Memory and Register Arrays
IEEE Transactions on Computers
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Samurai: protecting critical data in unsafe languages
Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
A memory soft error measurement on production systems
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Reference-driven performance anomaly identification
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
DRAM errors in the wild: a large-scale field study
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
IBM Journal of Research and Development
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Stealth works: emulating memory errors
RV'10 Proceedings of the First international conference on Runtime verification
Modeling and tolerating heterogeneous failures in large parallel systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
System implications of memory reliability in exascale computing
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Cooperative Application/OS DRAM fault recovery
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
LOT-ECC: localized and tiered reliability mechanisms for commodity memory systems
Proceedings of the 39th Annual International Symposium on Computer Architecture
Practical hardening of crash-tolerant systems
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
MAGE: adaptive granularity and ECC for resilient and power efficient memory systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A study of DRAM failures in the field
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Towards transparent hardening of distributed systems
Proceedings of the 9th Workshop on Hot Topics in Dependable Systems
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
Memory hardware reliability is an indispensable part of whole-system dependability. This paper presents the collection of realistic memory hardware error traces (including transient and non-transient errors) from production computer systems with more than 800GB memory for around nine months. Detailed information on the error addresses allows us to identify patterns of single-bit, row, column, and whole-chip memory errors. Based on the collected traces, we explore the implications of different hardware ECC protection schemes so as to identify the most common error causes and approximate error rates exposed to the software level. Further, we investigate the software system susceptibility to major error causes, with the goal of validating, questioning, and augmenting results of prior studies. In particular, we find that the earlier result that most memory hardware errors do not lead to incorrect software execution may not be valid, due to the unrealistic model of exclusive transient errors. Our study is based on an efficient memory error injection approach that applies hardware watchpoints on hotspot memory regions.