Algorithm-Based Fault Tolerance for Fail-Stop Failures

Authors:
Zizhong Chen;Jack Dongarra
Affiliations:
University of Tennessee, Knoxville;University of Tennessee, Knoxville
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2008

Citing 0
Cited 16

Optimal real number codes for fault tolerant matrix operations

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Constructing numerically stable real number codes using evolutionary computation

Proceedings of the 12th annual conference on Genetic and evolutionary computation
Scalable Earthquake Simulation on Petascale Supercomputers

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Algorithm-based recovery for HPL

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
High performance linpack benchmark: a fault tolerant implementation without checkpointing

Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing

Proceedings of the 20th international symposium on High performance distributed computing
Tolerating correlated failures for generalized Cartesian distributions via bipartite matching

Proceedings of the 8th ACM International Conference on Computing Frontiers
Scalable distributed consensus to support MPI fault tolerance

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Analyzing fault aware collective performance in a process fault tolerant MPI

Parallel Computing
Robust distributed orthogonalization based on randomized aggregation

Proceedings of the second workshop on Scalable algorithms for large-scale systems
Fault tolerant matrix-matrix multiplication: correcting soft errors on-line

Proceedings of the second workshop on Scalable algorithms for large-scale systems
Algorithm-based fault tolerance for dense matrix factorizations

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Fault tolerant preconditioned conjugate gradient for sparse linear system solution

Proceedings of the 26th ACM international conference on Supercomputing
Analysis and Evaluation of a New Algorithm Based Fault Tolerance for Computing Systems

International Journal of Grid and High Performance Computing
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Correcting soft errors online in LU factorization

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix-matrix multiplication kennel can be tolerated without checkpointing or message logging. It has been proved in previous algorithm-based fault tolerance that, for matrix-matrix multiplication, the checksum relationship in the input checksum matrices is preserved at the end of the computation no mater which algorithm is chosen. From this checksum relationship in the final computation results, processor miscalculations can be detected, located, and corrected at the end of the computation. However, whether this checksum relationship can be maintained in the middle of the computation or not remains open. In this paper, we first demonstrate that, for many matrix matrix multiplication algorithms, the checksum relationship in the input checksum matrices is not maintained in the middle of the computation. We then prove that, however, for the outer product version algorithm, the checksum relationship in the input checksum matrices can be maintained in the middle of the computation. Based on this checksum relationship maintained in the middle of the computation, we demonstrate that fail-stop process failures (which are often tolerated by checkpointing or message logging) in ScaLAPACK matrix-matrix multiplication can be tolerated without checkpointing or message logging.