FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior Under Faults
IEEE Transactions on Software Engineering - Special issue on software reliability
FERRARI: A Flexible Software-Based Fault and Error Injection System
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Hypervisor-based fault tolerance
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Hive: fault containment for shared-memory multiprocessors
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
IBM experiments in soft fails in computer electronics (1978–1994)
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
The Rio file cache: surviving operating system crashes
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
S/390 cluster technology: Parallel Sysplex
IBM Systems Journal
In search of clusters (2nd ed.)
In search of clusters (2nd ed.)
Exception handling: issues and a proposed notation
Communications of the ACM
Increasing relevance of memory hardware errors: a case for recoverable programming models
EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
Transaction Processing: Concepts and Techniques
Transaction Processing: Concepts and Techniques
Computer
Fault Injection Techniques and Tools
Computer
Experimental Evaluation of a COTS System for Space Application
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Assessment of COTS Microkernels by Fault Injection
DCCA '99 Proceedings of the conference on Dependable Computing for Critical Applications
SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
Evaluation of a Soft Error Tolerance Technique Based on Time and/or Space Redundancy
SBCCI '00 Proceedings of the 13th symposium on Integrated circuits and systems design
Exploring failure transparency and the limits of generic recovery
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
JVM susceptibility to memory errors
JVM'01 Proceedings of the 2001 Symposium on JavaTM Virtual Machine Research and Technology Symposium - Volume 1
lmbench: portable tools for performance analysis
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Fault Tolerance Design in JPEG 2000 Image Compression System
IEEE Transactions on Dependable and Secure Computing
Architecting a reliable CMP switch architecture
ACM Transactions on Architecture and Code Optimization (TACO)
A memory soft error measurement on production systems
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
A SAA7146-chip based design of the TAMIC
ELECTRO'06 Proceedings of the 4th WSEAS International Conference on Electromagnetics, Wireless and Optical Communications
Soft error vulnerability of iterative linear algebra methods
Proceedings of the 22nd annual international conference on Supercomputing
A manufacturing technology machine for blind people
ROCOM'09 Proceedings of the 9th WSEAS international conference on Robotics, control and manufacturing technology
A noble heuristic reading device for blind people
WSEAS TRANSACTIONS on SYSTEMS
A realistic evaluation of memory hardware errors and software system susceptibility
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
DRAM errors in the wild: a large-scale field study
Communications of the ACM
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
A study of DRAM failures in the field
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Feng shui of supercomputer memory: positional effects in DRAM and SRAM faults
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 15.00 |
It is widely understood that most system downtime is acounted for by programming errors and administration time. However, a growing body of work has indicated an increasing cause of downtime may stem from transient errors in computer system hardware due to external factors, such as cosmic rays. This work indicates that moving to denser semiconductor technologies at lower voltages has the potential to increase these transient errors. In this paper, we investigate the susceptibility of commodity operating systems and applications on commodity PC processors to these soft-errors and we introduce ideas regarding the improved recovery from these transient errors in software. Our results indicate that, for the Linux kernel and a Java virtual machine running sample workloads, many errors are not activated, mostly due to overwriting. In addition, given current and upcoming microprocessor support, our results indicate that those errors activated, which would normally lead to system reboot, need not be fatal to the system if software knowledge is used for simple software recovery. Together, they indicate the benefits of simple memory soft error recovery handling in commodity processors and software.