GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems
SIAM Journal on Scientific and Statistical Computing
A flexible inner-outer preconditioned GMRES algorithm
SIAM Journal on Scientific Computing
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Iterative Methods for Sparse Linear Systems
Iterative Methods for Sparse Linear Systems
Theory of Inexact Krylov Subspace Methods and Applications to Scientific Computing
SIAM Journal on Scientific Computing
Inexact Krylov Subspace Methods for Linear Systems
SIAM Journal on Matrix Analysis and Applications
An overview of the Trilinos project
ACM Transactions on Mathematical Software (TOMS) - Special issue on the Advanced CompuTational Software (ACTS) Collection
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Soft error vulnerability of iterative linear algebra methods
Proceedings of the 22nd annual international conference on Supercomputing
A realistic evaluation of memory hardware errors and software system susceptibility
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
DRAM errors in the wild: a large-scale field study
Communications of the ACM
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
libhashckpt: hash-based incremental checkpointing using GPU's
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
The university of Florida sparse matrix collection
ACM Transactions on Mathematical Software (TOMS)
Evaluating operating system vulnerability to memory errors
Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part IV
Using unreliable virtual hardware to inject errors in extreme-scale systems
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
When is multi-version checkpointing needed?
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Evaluating the feasibility of using memory content similarity to improve system resilience
Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
A block-asynchronous relaxation method for graphics processing units
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience challenges to more effectively utilize future systems. In this paper, we describe work on a cross-layer application / OS framework to handle uncorrected memory errors. We illustrate the use of this framework through its integration with a new fault-tolerant iterative solver within the Trilinos library, and present initial convergence results.