Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods

Authors:
Zizhong Chen
Affiliations:
University of California, Riverside, USA
Venue:
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2013

Citing 17
Cited 2

Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Iterative Methods for Sparse Linear Systems

Iterative Methods for Sparse Linear Systems
Fault tolerant high performance computing by a coding approach

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Condition Numbers of Gaussian Random Matrices

SIAM Journal on Matrix Analysis and Applications
Soft error vulnerability of iterative linear algebra methods

Proceedings of the 22nd annual international conference on Supercomputing
Algorithm-Based Fault Tolerance for Fail-Stop Failures

IEEE Transactions on Parallel and Distributed Systems
Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing

IEEE Transactions on Computers
Toward Exascale Resilience

International Journal of High Performance Computing Applications
Optimal real number codes for fault tolerant matrix operations

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems
Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Characterizing the impact of soft errors on iterative methods in scientific computing

Proceedings of the international conference on Supercomputing
High performance linpack benchmark: a fault tolerant implementation without checkpointing

Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing

Proceedings of the 20th international symposium on High performance distributed computing
Poster: a tunable, software-based DRAM error detection and correction library for HPC

Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
Numerically stable real number codes based on random matrices

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I

Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Self-stabilizing iterative solvers

ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Large supercomputers are especially susceptible to soft errors because of their large number of components. Soft errors can generally be detected offline through the comparison of the final computation results of two duplicated computations, but this approach often introduces significant overhead. This paper presents Online-ABFT, a simple but efficient online soft error detection technique that can detect soft errors in the widely used Krylov subspace iterative methods in the middle of the program execution so that the computation efficiency can be improved through the termination of the corrupted computation in a timely manner soon after a soft error occurs. Based on a simple verification of orthogonality and residual, Online-ABFT is easy to implement and highly efficient. Experimental results demonstrate that, when this online error detection approach is used together with checkpointing, it improves the time to obtain correct results by up to several orders of magnitude over the traditional offline approach.