GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems
SIAM Journal on Scientific and Statistical Computing
An analysis of algorithm-based fault tolerance techniques
Journal of Parallel and Distributed Computing
A Linear Algebraic Model of Algorithm-Based Fault Tolerance
IEEE Transactions on Computers
Algorithm-based fault tolerance for matrix inversion with maximum pivoting
Journal of Parallel and Distributed Computing
Self-stabilization
Inexact Preconditioned Conjugate Gradient Method with Inner-Outer Iteration
SIAM Journal on Scientific Computing
Self-stabilizing systems in spite of distributed control
Communications of the ACM
Theory of Inexact Krylov Subspace Methods and Applications to Scientific Computing
SIAM Journal on Scientific Computing
Inexact Krylov Subspace Methods for Linear Systems
SIAM Journal on Matrix Analysis and Applications
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Parallel Iterative Algorithms: From Sequential to Grid Computing (Chapman & Hall/Crc Numerical Analy & Scient Comp. Series)
Soft error vulnerability of iterative linear algebra methods
Proceedings of the 22nd annual international conference on Supercomputing
DRAM errors in the wild: a large-scale field study
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Low Power Probabilistic Floating Point Multiplier Design
ISVLSI '11 Proceedings of the 2011 IEEE Computer Society Annual Symposium on VLSI
Numerical Defect Correction as an Algorithm-Based Fault Tolerance Technique for Iterative Solvers
PRDC '11 Proceedings of the 2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing
Algorithm-based fault tolerance for dense matrix factorizations
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Fault tolerant preconditioned conjugate gradient for sparse linear system solution
Proceedings of the 26th ACM international conference on Supercomputing
Improving the Performance of Dynamical Simulations Via Multiple Right-Hand Sides
IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
A block-asynchronous relaxation method for graphics processing units
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
We show how to use the idea of self-stabilization, which originates in the context of distributed control, to make fault-tolerant iterative solvers. Generally, a self-stabilizing system is one that, starting from an arbitrary state (valid or invalid), reaches a valid state within a finite number of steps. This property imbues the system with a natural means of tolerating transient faults. We give two proof-of-concept examples of self-stabilizing iterative linear solvers: one for steepest descent (SD) and one for conjugate gradients (CG). Our self-stabilized versions of SD and CG require small amounts of fault-detection, e.g., we may check only for NaNs and infinities. We test our approach experimentally by analyzing its convergence and overhead for different types and rates of faults. Beyond the specific findings of this paper, we believe self-stabilization has promise to become a useful tool for constructing resilient solvers more generally.