Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Information Processing Letters
An analysis of algorithm-based fault tolerance techniques
Journal of Parallel and Distributed Computing
ACM Transactions on Computer Systems (TOCS)
Failure correction techniques for large disk arrays
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
Supercomputing out of recycled garbage: preliminary experience with Piranha
ICS '92 Proceedings of the 6th international conference on Supercomputing
Virtual memory mapped network interface for the SHRIMP multicomputer
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
EVENODD: an optimal scheme for tolerating double disk failures in RAID architectures
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Supporting Fault-Tolerant Parallel Programming in Linda
IEEE Transactions on Parallel and Distributed Systems
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
Solving Linear Systems on Vector and Shared Memory Computers
Solving Linear Systems on Vector and Shared Memory Computers
ickp: A Consistent Checkpointer for Multicomputers
IEEE Parallel & Distributed Technology: Systems & Technology
Low-Latency, Concurrent Checkpointing for Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Using Data Flow Information to Obtain Efficient Check Sets for Algorithm-Based Fault Tolerance
International Journal of Parallel Programming
The Journal of Supercomputing
A Parallel Adaptive Gauss-Jordan Algorithm
The Journal of Supercomputing
Fault-Tolerant Parallel Applications Using Queues and Actions
ICPP '97 Proceedings of the international Conference on Parallel Processing
System Checkpointing Using Reflection and Program Analysis
REFLECTION '01 Proceedings of the Third International Conference on Metalevel Architectures and Separation of Crosscutting Concerns
Fault tolerant matrix operations using checksum and reverse computation
FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Compiler-assisted generation of error-detecting parallel programs
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing
HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Algorithm-based fault tolerance applied to high performance computing
Journal of Parallel and Distributed Computing
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating operating system vulnerability to memory errors
Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Convergence analysis of evolutionary algorithms in the presence of crash-faults and cheaters
Computers & Mathematics with Applications
Accelerating incremental checkpointing for extreme-scale computing
Future Generation Computer Systems
Hi-index | 0.01 |
This paper is an exploration of diskless checkpointing for distributed scientific computations. With the widespread use of the ``Network Of Workstation'' (NOW) platform for distributed computing, long-running scientific computations need to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several algorithms for distributed scientific computing, including Cholesky factorization, LU factorization, QR factorization, and Preconditioned Conjugate Gradient. These implementations are able to run on PVM networks of at least N processors, and can complete with low overhead as long as any N processors remain functional. We discuss the details of how the algorithms are tuned for fault-tolerance, and present the performance results on a PVM network of SUN workstations, and on the IBM SP2.