Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing

Authors:
Youngbae Kim;James S. Plank;Jack J. Dongarra
Affiliations:
-;-;-
Venue:
HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Year:
1997

Citing 17
Cited 3

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
On distributed snapshots

Information Processing Letters
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Coding and information theory

Coding and information theory
Redundant disk arrays: reliable, parallel secondary storage

Redundant disk arrays: reliable, parallel secondary storage
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing

PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
Fault-tolerant matrix operations for networks of workstations using diskless checkpointing

Journal of Parallel and Distributed Computing
MPI: The Complete Reference

MPI: The Complete Reference
ickp: A Consistent Checkpointer for Multicomputers

IEEE Parallel & Distributed Technology: Systems & Technology
Low-Latency, Concurrent Checkpointing for Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
Fault tolerant matrix operations using checksum and reverse computation

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
A Failure Correction Technique for Parallel Storage Devices with Minimal Device Overhead

A Failure Correction Technique for Parallel Storage Devices with Minimal Device Overhead
Fault-tolerant matrix operations for parallel and distributed systems

Fault-tolerant matrix operations for parallel and distributed systems
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines

Scientific Programming

Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Algorithm-based fault tolerance applied to high performance computing

Journal of Parallel and Distributed Computing
Modeling and Analysis of Checkpoint I/O Operations

ASMTA '09 Proceedings of the 16th International Conference on Analytical and Stochastic Modeling Techniques and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recently, an algorithm-based approach using diskless checkpointing has been developed to provide fault tolerance for high-performance matrix operations. With this approach, fault tolerance is incorporated into the matrix operations, making them resilient to any single process failure with low overhead. In this paper, we present a technique called multiple checkpointing that enables the matrix operations to tolerate a certain set of multiple processor failures by adding multiple checkpointing processors. Results of implementing this technique on a network of workstations show improvement in both the reliability of the computation and the performance of checkpointing.