Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations

Authors:
James S. Plank;Youngbae Kim;Jack J. Dongarra
Affiliations:
-;-;-
Venue:
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Year:
1995

Citing 19
Cited 15

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
On distributed snapshots

Information Processing Letters
An analysis of algorithm-based fault tolerance techniques

Journal of Parallel and Distributed Computing
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
Failure correction techniques for large disk arrays

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Supercomputing out of recycled garbage: preliminary experience with Piranha

ICS '92 Proceedings of the 6th international conference on Supercomputing
Virtual memory mapped network interface for the SHRIMP multicomputer

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
EVENODD: an optimal scheme for tolerating double disk failures in RAID architectures

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Supporting Fault-Tolerant Parallel Programming in Linda

IEEE Transactions on Parallel and Distributed Systems
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing

PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Solving Linear Systems on Vector and Shared Memory Computers

Solving Linear Systems on Vector and Shared Memory Computers
ickp: A Consistent Checkpointer for Multicomputers

IEEE Parallel & Distributed Technology: Systems & Technology
Low-Latency, Concurrent Checkpointing for Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery

Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers

Using Data Flow Information to Obtain Efficient Check Sets for Algorithm-Based Fault Tolerance

International Journal of Parallel Programming
Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations

The Journal of Supercomputing
A Parallel Adaptive Gauss-Jordan Algorithm

The Journal of Supercomputing
Fault-Tolerant Parallel Applications Using Queues and Actions

ICPP '97 Proceedings of the international Conference on Parallel Processing
System Checkpointing Using Reflection and Program Analysis

REFLECTION '01 Proceedings of the Third International Conference on Metalevel Architectures and Separation of Crosscutting Concerns
Fault tolerant matrix operations using checksum and reverse computation

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Compiler-assisted generation of error-detecting parallel programs

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing

HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Algorithm-based fault tolerance applied to high performance computing

Journal of Parallel and Distributed Computing
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Fault-tolerant parallel applications with dynamic parallel schedules: a programmer's perspective

Dependable Systems
Evaluating operating system vulnerability to memory errors

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Convergence analysis of evolutionary algorithms in the presence of crash-faults and cheaters

Computers & Mathematics with Applications
Accelerating incremental checkpointing for extreme-scale computing

Future Generation Computer Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper is an exploration of diskless checkpointing for distributed scientific computations. With the widespread use of the ``Network Of Workstation'' (NOW) platform for distributed computing, long-running scientific computations need to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several algorithms for distributed scientific computing, including Cholesky factorization, LU factorization, QR factorization, and Preconditioned Conjugate Gradient. These implementations are able to run on PVM networks of at least N processors, and can complete with low overhead as long as any N processors remain functional. We discuss the details of how the algorithms are tuned for fault-tolerance, and present the performance results on a PVM network of SUN workstations, and on the IBM SP2.