Fault tolerant preconditioned conjugate gradient for sparse linear system solution

Authors:
Manu Shantharam;Sowmyalatha Srinivasmurthy;Padma Raghavan
Affiliations:
The Pennsylvania State University, State College, PA, USA;The Pennsylvania State University, State College, PA, USA;The Pennsylvania State University, State College, PA, USA
Venue:
Proceedings of the 26th ACM international conference on Supercomputing
Year:
2012

Citing 19
Cited 1

A Fault-Tolerant FFT Processor

IEEE Transactions on Computers
An analysis of algorithm-based fault tolerance techniques

Journal of Parallel and Distributed Computing
A Linear Algebraic Model of Algorithm-Based Fault Tolerance

IEEE Transactions on Computers
Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor

IEEE Transactions on Computers
Algorithm-based fault tolerance for matrix inversion with maximum pivoting

Journal of Parallel and Distributed Computing
Algorithm-Based Fault Tolerant Synthesis for Linear Operations

IEEE Transactions on Computers
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
Computer Solution of Large Sparse Positive Definite

Computer Solution of Large Sparse Positive Definite
Effective Preconditioning through Ordering Interleaved with Incomplete Factorization

SIAM Journal on Matrix Analysis and Applications
Experimental evaluation of application-level checkpointing for OpenMP programs

Proceedings of the 20th annual international conference on Supercomputing
Parallel Processing for Scientific Computing (Software, Environments and Tools)

Parallel Processing for Scientific Computing (Software, Environments and Tools)
A Fault-Tolerant Parallel Algorithm for Iterative Solution of the Laplace Equation

ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 03
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Soft error vulnerability of iterative linear algebra methods

Proceedings of the 22nd annual international conference on Supercomputing
Algorithm-Based Fault Tolerance for Fail-Stop Failures

IEEE Transactions on Parallel and Distributed Systems
Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing

IEEE Transactions on Computers
Toward Exascale Resilience

International Journal of High Performance Computing Applications
Characterizing the impact of soft errors on iterative methods in scientific computing

Proceedings of the international conference on Supercomputing

Self-stabilizing iterative solvers

ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In scientific applications that involve dense matrices, checksum encodings have yielded "algorithm-based fault tolerance" (ABFT) in the event of data corruption from either hard or transient (soft) errors in the hardware. However, such checksum-based ABFT techniques have not been developed when sparse matrices are involved, for example, in sparse linear system solution through a method such as preconditioned conjugate gradients (PCG). In this paper, we develop a new sparse checksum encoded algorithm-based fault tolerant PCG, S-ABFT-PCG. Our checksum based approach can be applied to all the key operations in PCG, including sparse matrix-vector multiplication (SpMV), vector operations and the application of a preconditioner through sparse triangular solution. We prove that our approach detects a single error in the matrix and vector elements and in the metadata representing the sparse matrix row or column indices, when the linear system has a coefficient matrix that is symmetric positive definite and strictly diagonally dominant. The overhead of S-ABFT-PCG is proportional to the cost of a few O(n) vector operations, a value that is relatively low compared to the total cost of a PCG iteration with an SpMV and two triangular solutions. However, if an error is detected, then the underlying PCG iteration must be recomputed because our approach does not enable checksum encoded recovery from the error. We compare our S-ABFT-PCG with a classical ABFT-PCG (C-ABFT-PCG) that detects and recovers from a single error in the SpMV kernel, but does not provide fault tolerance for the sparse triangular solution kernel. Our experimental results indicate that in the event of no errors, compared to a PCG with no ABFT, the overheads of S-ABFT-PCG are 11.3% and lower than the 23.1% overheads of C-ABFT-PCG. Furthermore, in the event of a single error in the application of the preconditioner through triangular solution, C-ABFT-PCG suffers from significant increases in iteration counts, leading to performance degradations of 63.2% on average compared to 3.2% on average for S-ABFT-PCG.