GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems
SIAM Journal on Scientific and Statistical Computing
A Linear Algebraic Model of Algorithm-Based Fault Tolerance
IEEE Transactions on Computers
Algorithm-Based Error-Detection Schemes for Iterative Solution of Partial Differential Equations
IEEE Transactions on Computers
Inexact Preconditioned Conjugate Gradient Method with Inner-Outer Iteration
SIAM Journal on Scientific Computing
Performance Evaluation of Checksum-Based ABFT
DFT '01 Proceedings of the 16th IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems
An Algorithm-Based Error Detection Scheme for the Multigrid Method
IEEE Transactions on Computers
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
An analysis of data corruption in the storage stack
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Soft error vulnerability of iterative linear algebra methods
Proceedings of the 22nd annual international conference on Supercomputing
Reliability estimation for large distributed software systems
CASCON '08 Proceedings of the 2008 conference of the center for advanced studies on collaborative research: meeting of minds
Markov Chains and Stochastic Stability
Markov Chains and Stochastic Stability
Relax: an architectural framework for software recovery of hardware faults
Proceedings of the 37th annual international symposium on Computer architecture
Patterns and statistical analysis for understanding reduced resource computing
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
ERSA: error resilient system architecture for probabilistic applications
Proceedings of the Conference on Design, Automation and Test in Europe
Tests and tolerances for high-performance software-implemehted fault detection
IEEE Transactions on Computers
Hi-index | 0.00 |
As HPC system sizes grow to millions of cores and chip feature sizes continue to decrease, HPC applications become increasingly exposed to transient hardware faults. These faults can cause aborts and performance degradation. Most importantly, they can corrupt results. Thus, we must evaluate the fault vulnerability of key HPC algorithms to develop cost-effective techniques to improve application resilience. We present an approach that analyzes the vulnerability of applications to faults, systematically reduces it by protecting the most vulnerable components and predicts application vulnerability at large scales. Weinitially focus on sparse scientific applications and apply our approachin this paper to the Algebraic Multi Grid (AMG) algorithm. We empirically analyze AMG's vulnerability to hardware faults in both sequential and parallel (hybrid MPI/OpenMP) executions on up to 1,600 cores and propose and evaluate the use of targeted pointer replication to reduce it. Our techniques increase AMG's resilience to transient hardware faults by 50-80% and improve its scalability on faulty computational environments by 35%. Further, we show how to model AMG's scalability in fault-prone environments to predict execution times of large-scale runs accurately.