Algorithm-based recovery for HPL

Authors:
Teresa Davies;Zizhong Chen;Christer Karlsson;Hui Liu
Affiliations:
Colorado School of Mines, Golden, CO, USA;Colorado School of Mines, Golden, CO, USA;Colorado School of Mines, Golden, CO, USA;Colorado School of Mines, Golden, CO, USA
Venue:
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Year:
2011

Citing 6
Cited 0

An analysis of algorithm-based fault tolerance techniques

Journal of Parallel and Distributed Computing
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Fault-tolerant matrix operations for parallel and distributed systems

Fault-tolerant matrix operations for parallel and distributed systems
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
Algorithm-Based Fault Tolerance for Fail-Stop Failures

IEEE Transactions on Parallel and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

When more processors are used for a calculation, the probability that one will fail during the calculation increases. Fault tolerance is a technique for allowing a calculation to survive a failure, and includes recovering lost data. A common method of recovery is diskless checkpointing. However, it has high overhead when a large amount of data is involved, as is the case with matrix operations. A checksum-based method allows fault tolerance of matrix operations with lower overhead. This technique is applicable to the LU decomposition in the benchmark HPL.