Lightweight recoverable virtual memory
SOSP '93 Proceedings of the fourteenth ACM symposium on Operating systems principles
FINE: A Fault Injection and Monitoring Environment for Tracing the UNIX System Behavior Under Faults
IEEE Transactions on Software Engineering - Special issue on software reliability
FERRARI: A Flexible Software-Based Fault and Error Injection System
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Hive: fault containment for shared-memory multiprocessors
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
IBM experiments in soft fails in computer electronics (1978–1994)
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
The Rio file cache: surviving operating system crashes
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
S/390 cluster technology: Parallel Sysplex
IBM Systems Journal
Hardware fault containment in scalable shared-memory multiprocessors
Proceedings of the 24th annual international symposium on Computer architecture
In search of clusters (2nd ed.)
In search of clusters (2nd ed.)
Transaction Processing: Concepts and Techniques
Transaction Processing: Concepts and Techniques
Fault Injection Techniques and Tools
Computer
Starfire: Extending the SMP Envelope
IEEE Micro
SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
Hints for computer system design
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Error Scope on a Computational Grid: Theory and Practice
HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
Improving availability with recursive microreboots: a soft-state system case study
Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Analyzing heap error behavior in embedded JVM environments
Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Susceptibility of Commodity Systems and Software to Memory Soft Errors
IEEE Transactions on Computers
JVM susceptibility to memory errors
JVM'01 Proceedings of the 2001 Symposium on JavaTM Virtual Machine Research and Technology Symposium - Volume 1
The effects of metadata corruption on nfs
Proceedings of the 2007 ACM workshop on Storage security and survivability
DRAM errors in the wild: a large-scale field study
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Membrane: Operating system support for restartable file systems
ACM Transactions on Storage (TOS)
End-to-end data integrity for file systems: a ZFS case study
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Membrane: operating system support for restartable file systems
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Tolerating file-system mistakes with EnvyFS
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
DRAM errors in the wild: a large-scale field study
Communications of the ACM
Review: A survey of memory error correcting techniques for improved reliability
Journal of Network and Computer Applications
Hi-index | 0.02 |
It is a common belief that most of computer system failures nowadays stem from programming errors. Computer systems are becoming more complex and harder to maintain and administer, making software errors an even more common case, while contemporary computer architectures are optimized for price and performance and not for availability. In this paper, we raise a case for an increasing relevance of memory hardware soft-errors. In particular with the introduction of 64-bit processors, memory scaling is significantly increased, resulting in higher probability for memory errors. At the same time, due to the ubiquitous use of computers, such as at higher altitudes, environmental conditions impact errors (terrestrial cosmic rays). Finally, in shared memory systems, the failure of one node's memory can take the whole machine down. Current commodity systems do not tolerate memory errors, neither commodity hardware (processors, memories, interconnects) nor software (operating systems, applications, application environments). At the same time, users expect increased reliability. We present the problems of such errors and some solutions for memory error recovery at the processor, operating system and programming model level.