Matrix computations (3rd ed.)
IEEE Transactions on Parallel and Distributed Systems
Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Algorithm-Based Fault Tolerance for Fail-Stop Failures
IEEE Transactions on Parallel and Distributed Systems
High Performance Dense Linear System Solver with Soft Error Resilience
CLUSTER '11 Proceedings of the 2011 IEEE International Conference on Cluster Computing
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Soft errors normally do not interrupt the execution of the affected program, but the affected computation results can not be trusted any more. A well known technique to correct soft errors in matrix-matrix multiplication is algorithm-based fault tolerance (ABFT). While ABFT achieves much better efficiency than triple modular redundancy (TMR) - a traditional general technique to correct soft errors, both ABFT and TMR detect errors off-line after the computation is finished. This paper extends the traditional ABFT technique from off-line to on-line so that soft errors in matrix-matrix multiplication can be detect in the middle of the computation during the program execution and higher efficiency can be achieved by correcting the corrupted computations in a timely manner. Experimental results demonstrate that the proposed technique can correct one error every ten seconds with negligible (i.e., less than 1%) performance penalty over the ATLAS dgemm().