Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice

  • Authors:
  • John A. Gunnels;Robert A. van de Geijn;Daniel S. Katz;Enrique S. Quintana-Ortí

  • Affiliations:
  • -;-;-;-

  • Venue:
  • DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Abstract: In this paper, we extend the theory and practice regarding algorithmic fault-tolerant matrix-matrix multiplication, C = AB, in a number of ways. First, we propose low-overhead methods for detecting errors introduced not only in C but also in A and/or B. Second, we show that, theoretically, these methods will detect all errors as long as only one entry is corrupted. Third, we propose a low-overhead roll-back approach to correct errors once detected. Finally, we give a high-performance implementation of matrix-matrix multiplication that incorporates these error detection and correction methods. Empirical results demonstrate that these methods work well in practice while imposing an acceptable level of overhead relative to high-performance implementations without fault-tolerance.