IBM experiments in soft fails in computer electronics (1978–1994)
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Field testing for cosmic ray soft errors in semiconductor memories
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Accelerated testing for cosmic soft-error rate
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Availability and utility of idle memory in workstation clusters
SIGMETRICS '99 Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Progress-based regulation of low-importance processes
Proceedings of the seventeenth ACM symposium on Operating systems principles
Information and control in gray-box systems
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Susceptibility of Commodity Systems and Software to Memory Soft Errors
IEEE Transactions on Computers
Assessing Fault Sensitivity in MPI Applications
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Transparent contribution of memory
ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
An empirical study of memory hardware errors in a server farm
HotDep'07 Proceedings of the 3rd workshop on on Hot Topics in System Dependability
DRAM errors in the wild: a large-scale field study
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
End-to-end data integrity for file systems: a ZFS case study
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
A realistic evaluation of memory hardware errors and software system susceptibility
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
DRAM errors in the wild: a large-scale field study
Communications of the ACM
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Lossless fault-tolerant data structures with additive overhead
WADS'11 Proceedings of the 12th international conference on Algorithms and data structures
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
A study of DRAM failures in the field
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Preserving Hamming Distance in Arithmetic and Logical Operations
Journal of Electronic Testing: Theory and Applications
Hi-index | 0.02 |
Memory state can be corrupted by the impact of particles causing single-event upsets (SEUs). Understanding and dealing with these soft (or transient) errors is important for system reliability. Several earlier studies have provided field test measurement results on memory soft error rate, but no results were available for recent production computer systems. We believe the measurement results on real production systems are uniquely valuable due to various environmental effects. This paper presents methodologies for memory soft error measurement on production systems where performance impact on existing running applications must be negligible and the system administrative control might or might not be available. We conducted measurements in three distinct system environments: a rack-mounted server farm for a popular Internet service (Ask.com search engine), a set of office desktop computers (Univ. of Rochester), and a geographically distributed network testbed (PlanetLab). Our preliminary measurement on over 300 machines for varying multi-month periods finds 2 suspected soft errors. In particular, our result on the Internet servers indicates that, with high probability, the soft error rate is at least two orders of magnitude lower than those reported previously. We provide discussions that attribute the low error rate to several factors in today's production system environments. As a contrast, our measurement unintentionally discovers permanent (or hard) memory faults on 9 out of 212 Ask.com machines, suggesting the relative commonness of hard memory faults.