Fault tolerant preconditioned conjugate gradient for sparse linear system solution

  • Authors:
  • Manu Shantharam;Sowmyalatha Srinivasmurthy;Padma Raghavan

  • Affiliations:
  • The Pennsylvania State University, State College, PA, USA;The Pennsylvania State University, State College, PA, USA;The Pennsylvania State University, State College, PA, USA

  • Venue:
  • Proceedings of the 26th ACM international conference on Supercomputing
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

In scientific applications that involve dense matrices, checksum encodings have yielded "algorithm-based fault tolerance" (ABFT) in the event of data corruption from either hard or transient (soft) errors in the hardware. However, such checksum-based ABFT techniques have not been developed when sparse matrices are involved, for example, in sparse linear system solution through a method such as preconditioned conjugate gradients (PCG). In this paper, we develop a new sparse checksum encoded algorithm-based fault tolerant PCG, S-ABFT-PCG. Our checksum based approach can be applied to all the key operations in PCG, including sparse matrix-vector multiplication (SpMV), vector operations and the application of a preconditioner through sparse triangular solution. We prove that our approach detects a single error in the matrix and vector elements and in the metadata representing the sparse matrix row or column indices, when the linear system has a coefficient matrix that is symmetric positive definite and strictly diagonally dominant. The overhead of S-ABFT-PCG is proportional to the cost of a few O(n) vector operations, a value that is relatively low compared to the total cost of a PCG iteration with an SpMV and two triangular solutions. However, if an error is detected, then the underlying PCG iteration must be recomputed because our approach does not enable checksum encoded recovery from the error. We compare our S-ABFT-PCG with a classical ABFT-PCG (C-ABFT-PCG) that detects and recovers from a single error in the SpMV kernel, but does not provide fault tolerance for the sparse triangular solution kernel. Our experimental results indicate that in the event of no errors, compared to a PCG with no ABFT, the overheads of S-ABFT-PCG are 11.3% and lower than the 23.1% overheads of C-ABFT-PCG. Furthermore, in the event of a single error in the application of the preconditioner through triangular solution, C-ABFT-PCG suffers from significant increases in iteration counts, leading to performance degradations of 63.2% on average compared to 3.2% on average for S-ABFT-PCG.