Correcting soft errors online in LU factorization

  • Authors:
  • Teresa Davies;Zizhong Chen

  • Affiliations:
  • Colorado School of Mines, Golden, CO, USA;University of California, Riverside, Riverside, CA, USA

  • Venue:
  • Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

In high-performance systems, the probability of failure is higher with more processors. Errors in calculations may occur that cannot be detected by outside means. To address this problem, we create a checksum-based approach that detects and recovers from calculation errors. We apply this approach to the LU factorization algorithm used by High Performance Linpack. Our approach has low overhead; in contrast to an existing approach that requires repeated calculation, it repeats only a fraction of the calculation during recovery. Because of error propagation, the existing approach has to repeat calculations when soft errors occur. Our approach detects and corrects errors during the calculation before they are propagated. The frequency of checking can be adjusted for the error rate, resulting in a flexible method of fault tolerance.