Fault tolerant matrix operations using checksum and reverse computation

Authors:
Youngbae Kim;J. S. Plank;J. J. Dongarra
Affiliations:
-;-;-
Venue:
FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Year:
1996

Citing 18
Cited 3

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
On distributed snapshots

Information Processing Letters
An analysis of algorithm-based fault tolerance techniques

Journal of Parallel and Distributed Computing
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Floating Point Fault Tolerance with Backward Error Assertions

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Using MPI: portable parallel programming with the message-passing interface

Using MPI: portable parallel programming with the message-passing interface
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing

PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication

IBM Journal of Research and Development
Fault-tolerant matrix operations for networks of workstations using diskless checkpointing

Journal of Parallel and Distributed Computing
MPI: The Complete Reference

MPI: The Complete Reference
Computer Methods for Mathematical Computations

Computer Methods for Mathematical Computations
Solving Linear Systems on Vector and Shared Memory Computers

Solving Linear Systems on Vector and Shared Memory Computers
ickp: A Consistent Checkpointer for Multicomputers

IEEE Parallel & Distributed Technology: Systems & Technology
Low-Latency, Concurrent Checkpointing for Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines

Scientific Programming

Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing

HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Parallel reduction to hessenberg form with algorithm-based fault tolerance

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
CPU-GPU hybrid bidiagonal reduction with soft error resilience

ScalA '13 Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a technique, based on checksum and reverse computation, that enables high-performance matrix operations to be fault-tolerant with low overhead. We have implemented this technique on five matrix operations: matrix multiplication, Cholesky factorization, LU factorization, QR factorization and Hessenberg reduction. The overhead of checkpointing and recovery is analyzed both theoretically and experimentally. These analyses confirm that our technique can provide fault tolerance for these high-performance matrix operations with low overhead.