SRC: soft error detection and recovery for high performance linpack

Authors:
Teresa Davies;Zizhong Chen
Affiliations:
Colorado School of Mines, Golden, CO, USA;Colorado School of Mines, Golden, CO, USA
Venue:
Proceedings of the international conference on Supercomputing
Year:
2011

Citing 2
Cited 0

An analysis of algorithm-based fault tolerance techniques

Journal of Parallel and Distributed Computing
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers

Quantified Score

Hi-index	0.00

Visualization

Abstract

In high-performance systems, the probability of failure is higher for larger systems. Errors in calculations may occur that cannot be detected by any other means. To address this problem, we create a checksum-based approach that detects and recovers from calculation errors. We apply this approach to the LU factorization algorithm used by High Performance Linpack. Our approach has low overhead. In contrast to existing approaches that require repeated calculation, it repeats only a fraction of the calculation during recovery. The frequency of checking can be adjusted for the error rate, resulting in a flexible method of fault tolerance.