Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice

Authors:
John A. Gunnels;Robert A. van de Geijn;Daniel S. Katz;Enrique S. Quintana-Ortí
Affiliations:
-;-;-;-
Venue:
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Year:
2001

Citing 13
Cited 10

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
LAPACK's user's guide

LAPACK's user's guide
Software reliability via run-time result-checking

Journal of the ACM (JACM)
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark

ACM Transactions on Mathematical Software (TOMS)
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Solving Linear Systems on Vector and Shared Memory Computers

Solving Linear Systems on Vector and Shared Memory Computers
A Family of High-Performance Matrix Multiplication Algorithms

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Formal Methods for High-Performance Linear Algebra Libraries

Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
Software-Implemented Fault Detection for High-Performance Space Applications

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Demonstration of the Remote Exploration and Experimentation (REE) Fault-Tolerant Parallel-Processing Supercomputer for Spacecraft Onboard Scientific Data Processing

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Algorithm Based Fault Tolerance versus Result-Checking for Matrix Computations

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Formal Linear Algebra Methods Environment (FLAME) Overview

Formal Linear Algebra Methods Environment (FLAME) Overview
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers

NASA Advances Robotic Space Exploration

Computer
The Effects of an ARMOR-Based SIFT Environment on the Performance and Dependability of User Applications

IEEE Transactions on Software Engineering
Embedded/Real-Time Systems

International Journal of High Performance Computing Applications
High Performance Computing Systems for Autonomous Spaceborne Missions

International Journal of High Performance Computing Applications
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix

ACM Transactions on Mathematical Software (TOMS)
Optimal real number codes for fault tolerant matrix operations

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
High performance linpack benchmark: a fault tolerant implementation without checkpointing

Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing

Proceedings of the 20th international symposium on High performance distributed computing
Fault tolerant matrix-matrix multiplication: correcting soft errors on-line

Proceedings of the second workshop on Scalable algorithms for large-scale systems
Correcting soft errors online in LU factorization

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Abstract: In this paper, we extend the theory and practice regarding algorithmic fault-tolerant matrix-matrix multiplication, C = AB, in a number of ways. First, we propose low-overhead methods for detecting errors introduced not only in C but also in A and/or B. Second, we show that, theoretically, these methods will detect all errors as long as only one entry is corrupted. Third, we propose a low-overhead roll-back approach to correct errors once detected. Finally, we give a high-performance implementation of matrix-matrix multiplication that incorporates these error detection and correction methods. Empirical results demonstrate that these methods work well in practice while imposing an acceptable level of overhead relative to high-performance implementations without fault-tolerance.