A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
LAPACK's user's guide
Software reliability via run-time result-checking
Journal of the ACM (JACM)
GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark
ACM Transactions on Mathematical Software (TOMS)
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Solving Linear Systems on Vector and Shared Memory Computers
Solving Linear Systems on Vector and Shared Memory Computers
A Family of High-Performance Matrix Multiplication Algorithms
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Formal Methods for High-Performance Linear Algebra Libraries
Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
Software-Implemented Fault Detection for High-Performance Space Applications
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Algorithm Based Fault Tolerance versus Result-Checking for Matrix Computations
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Formal Linear Algebra Methods Environment (FLAME) Overview
Formal Linear Algebra Methods Environment (FLAME) Overview
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
IEEE Transactions on Software Engineering
International Journal of High Performance Computing Applications
High Performance Computing Systems for Autonomous Spaceborne Missions
International Journal of High Performance Computing Applications
Families of algorithms related to the inversion of a Symmetric Positive Definite matrix
ACM Transactions on Mathematical Software (TOMS)
Optimal real number codes for fault tolerant matrix operations
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
High performance linpack benchmark: a fault tolerant implementation without checkpointing
Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing
Proceedings of the 20th international symposium on High performance distributed computing
Fault tolerant matrix-matrix multiplication: correcting soft errors on-line
Proceedings of the second workshop on Scalable algorithms for large-scale systems
Correcting soft errors online in LU factorization
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Hi-index | 0.00 |
Abstract: In this paper, we extend the theory and practice regarding algorithmic fault-tolerant matrix-matrix multiplication, C = AB, in a number of ways. First, we propose low-overhead methods for detecting errors introduced not only in C but also in A and/or B. Second, we show that, theoretically, these methods will detect all errors as long as only one entry is corrupted. Third, we propose a low-overhead roll-back approach to correct errors once detected. Finally, we give a high-performance implementation of matrix-matrix multiplication that incorporates these error detection and correction methods. Empirical results demonstrate that these methods work well in practice while imposing an acceptable level of overhead relative to high-performance implementations without fault-tolerance.