Algorithm-based recovery for HPL

  • Authors:
  • Teresa Davies;Zizhong Chen;Christer Karlsson;Hui Liu

  • Affiliations:
  • Colorado School of Mines, Golden, CO, USA;Colorado School of Mines, Golden, CO, USA;Colorado School of Mines, Golden, CO, USA;Colorado School of Mines, Golden, CO, USA

  • Venue:
  • Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

When more processors are used for a calculation, the probability that one will fail during the calculation increases. Fault tolerance is a technique for allowing a calculation to survive a failure, and includes recovering lost data. A common method of recovery is diskless checkpointing. However, it has high overhead when a large amount of data is involved, as is the case with matrix operations. A checksum-based method allows fault tolerance of matrix operations with lower overhead. This technique is applicable to the LU decomposition in the benchmark HPL.