Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
A Case for Clumsy Packet Processors
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Increasing Register File Immunity to Transient Errors
Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Analysis and Evaluation of a New Algorithm Based Fault Tolerance for Computing Systems
International Journal of Grid and High Performance Computing
Hi-index | 0.00 |
We describe and test a software approach to overcoming radiation-induced errors in spaceborne applications running on commercial off-the-shelf components. The approach uses checksum methods to validate results returned by a numerical subroutine operating subject to unpredictable errors in data. We can treat subroutines that return results satisfying a necessary condition having a linear form; the checksum tests compliance with this condition. We discuss the theory and practice of setting numerical tolerances to separate errors caused by a fault from those inherent in finite-precision numerical calculations. We test both the general effectiveness of the linear fault tolerant schemes we propose, and the correct behavior of our parallel implementation of them.