Recovery in distributed systems using asynchronous message logging and checkpointing
PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources
Proceedings of the General Track: 2002 USENIX Annual Technical Conference
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Collective operations in application-level fault-tolerant MPI
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
An Experimental Study about Diskless Checkpointing
EUROMICRO '98 Proceedings of the 24th Conference on EUROMICRO - Volume 1
SWIFT: Software Implemented Fault Tolerance
Proceedings of the international symposium on Code generation and optimization
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Soft error vulnerability of iterative linear algebra methods
Proceedings of the 22nd annual international conference on Supercomputing
2-step algorithm for enhancing effectiveness of sender-based message logging
SpringSim '07 Proceedings of the 2007 spring simulation multiconference - Volume 2
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
DRAM errors in the wild: a large-scale field study
Communications of the ACM
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
System implications of memory reliability in exascale computing
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Proactive fault tolerance in MPI applications via task migration
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Scalable fault tolerant MPI: extending the recovery algorithm
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Poster: a tunable, software-based DRAM error detection and correction library for HPC
Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Super-Scalable algorithms for computing on 100,000 processors
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
Cooperative Application/OS DRAM fault recovery
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Combining Partial Redundancy and Checkpointing for HPC
ICDCS '12 Proceedings of the 2012 IEEE 32nd International Conference on Distributed Computing Systems
Using unreliable virtual hardware to inject errors in extreme-scale systems
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Hobbes: composition and virtualization as the foundations of an extreme-scale OS/R
Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
Hi-index | 0.00 |
Reliability is of great concern to the scalability of extreme-scale systems. Of particular concern are soft errors in main memory, which are a leading cause of failures on current systems and are predicted to be the leading cause on future systems. While great effort has gone into designing algorithms and applications that can continue to make progress in the presence of these errors without restarting, the most critical software running on a node, the operating system (OS), is currently left relatively unprotected. OS resiliency is of particular importance because, though this software typically represents a small footprint of a compute node's physical memory, recent studies show more memory errors in this region of memory than the remainder of the system. In this paper, we investigate the soft error vulnerability of two operating systems used in current and future high-performance computing systems: Kitten, the lightweight kernel developed at Sandia National Laboratories, and CLE, a high-performance Linux-based operating system developed by Cray. For each of these platforms, we outline major structures and subsystems that are vulnerable to soft errors and describe methods that could be used to reconstruct damaged state. Our results show the Kitten lightweight operating system may be an easier target to harden against memory errors due to its smaller memory footprint, largely deterministic state, and simpler system structure.