A Fault-Tolerant FFT Processor
IEEE Transactions on Computers
An analysis of algorithm-based fault tolerance techniques
Journal of Parallel and Distributed Computing
A Linear Algebraic Model of Algorithm-Based Fault Tolerance
IEEE Transactions on Computers
Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor
IEEE Transactions on Computers
Algorithm-based fault tolerance for matrix inversion with maximum pivoting
Journal of Parallel and Distributed Computing
Algorithm-Based Fault Tolerant Synthesis for Linear Operations
IEEE Transactions on Computers
Fail-stop processors: an approach to designing fault-tolerant computing systems
ACM Transactions on Computer Systems (TOCS)
Computer Solution of Large Sparse Positive Definite
Computer Solution of Large Sparse Positive Definite
Effective Preconditioning through Ordering Interleaved with Incomplete Factorization
SIAM Journal on Matrix Analysis and Applications
Experimental evaluation of application-level checkpointing for OpenMP programs
Proceedings of the 20th annual international conference on Supercomputing
Parallel Processing for Scientific Computing (Software, Environments and Tools)
Parallel Processing for Scientific Computing (Software, Environments and Tools)
A Fault-Tolerant Parallel Algorithm for Iterative Solution of the Laplace Equation
ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 03
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Soft error vulnerability of iterative linear algebra methods
Proceedings of the 22nd annual international conference on Supercomputing
Algorithm-Based Fault Tolerance for Fail-Stop Failures
IEEE Transactions on Parallel and Distributed Systems
Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing
IEEE Transactions on Computers
International Journal of High Performance Computing Applications
Characterizing the impact of soft errors on iterative methods in scientific computing
Proceedings of the international conference on Supercomputing
Self-stabilizing iterative solvers
ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Hi-index | 0.00 |
In scientific applications that involve dense matrices, checksum encodings have yielded "algorithm-based fault tolerance" (ABFT) in the event of data corruption from either hard or transient (soft) errors in the hardware. However, such checksum-based ABFT techniques have not been developed when sparse matrices are involved, for example, in sparse linear system solution through a method such as preconditioned conjugate gradients (PCG). In this paper, we develop a new sparse checksum encoded algorithm-based fault tolerant PCG, S-ABFT-PCG. Our checksum based approach can be applied to all the key operations in PCG, including sparse matrix-vector multiplication (SpMV), vector operations and the application of a preconditioner through sparse triangular solution. We prove that our approach detects a single error in the matrix and vector elements and in the metadata representing the sparse matrix row or column indices, when the linear system has a coefficient matrix that is symmetric positive definite and strictly diagonally dominant. The overhead of S-ABFT-PCG is proportional to the cost of a few O(n) vector operations, a value that is relatively low compared to the total cost of a PCG iteration with an SpMV and two triangular solutions. However, if an error is detected, then the underlying PCG iteration must be recomputed because our approach does not enable checksum encoded recovery from the error. We compare our S-ABFT-PCG with a classical ABFT-PCG (C-ABFT-PCG) that detects and recovers from a single error in the SpMV kernel, but does not provide fault tolerance for the sparse triangular solution kernel. Our experimental results indicate that in the event of no errors, compared to a PCG with no ABFT, the overheads of S-ABFT-PCG are 11.3% and lower than the 23.1% overheads of C-ABFT-PCG. Furthermore, in the event of a single error in the application of the preconditioner through triangular solution, C-ABFT-PCG suffers from significant increases in iteration counts, leading to performance degradations of 63.2% on average compared to 3.2% on average for S-ABFT-PCG.