An analysis of algorithm-based fault tolerance techniques
Journal of Parallel and Distributed Computing
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Hi-index | 0.00 |
In high-performance systems, the probability of failure is higher for larger systems. Errors in calculations may occur that cannot be detected by any other means. To address this problem, we create a checksum-based approach that detects and recovers from calculation errors. We apply this approach to the LU factorization algorithm used by High Performance Linpack. Our approach has low overhead. In contrast to existing approaches that require repeated calculation, it repeats only a fraction of the calculation during recovery. The frequency of checking can be adjusted for the error rate, resulting in a flexible method of fault tolerance.