Fault tolerant matrix operations using checksum and reverse computation

  • Authors:
  • Youngbae Kim;J. S. Plank;J. J. Dongarra

  • Affiliations:
  • -;-;-

  • Venue:
  • FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
  • Year:
  • 1996

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we present a technique, based on checksum and reverse computation, that enables high-performance matrix operations to be fault-tolerant with low overhead. We have implemented this technique on five matrix operations: matrix multiplication, Cholesky factorization, LU factorization, QR factorization and Hessenberg reduction. The overhead of checkpointing and recovery is analyzed both theoretically and experimentally. These analyses confirm that our technique can provide fault tolerance for these high-performance matrix operations with low overhead.