A Linear Algebraic Model of Algorithm-Based Fault Tolerance
IEEE Transactions on Computers
PVM: a framework for parallel distributed computing
Concurrency: Practice and Experience
Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor
IEEE Transactions on Computers
Algorithmic fault tolerance using the Lanczos method
SIAM Journal on Matrix Analysis and Applications
Algorithm-Based Fault Tolerant Synthesis for Linear Operations
IEEE Transactions on Computers
Fault-tolerant matrix operations for networks of workstations using diskless checkpointing
Journal of Parallel and Distributed Computing
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems
Software—Practice & Experience
Efficient management of parallelism in object-oriented numerical software libraries
Modern software tools for scientific computing
IEEE Transactions on Parallel and Distributed Systems
Fail-stop processors: an approach to designing fault-tolerant computing systems
ACM Transactions on Computer Systems (TOCS)
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Performance optimizations and bounds for sparse matrix-vector multiply
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Automated application-level checkpointing of MPI programs
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Fault-tolerant matrix operations for parallel and distributed systems
Fault-tolerant matrix operations for parallel and distributed systems
Fault tolerant high performance computing by a coding approach
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
An overview of the Trilinos project
ACM Transactions on Mathematical Software (TOMS) - Special issue on the Advanced CompuTational Software (ACTS) Collection
Condition Numbers of Gaussian Random Matrices
SIAM Journal on Matrix Analysis and Applications
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Experimental evaluation of application-level checkpointing for OpenMP programs
Proceedings of the 20th annual international conference on Supercomputing
Scalable techniques for fault tolerant high performance computing
Scalable techniques for fault tolerant high performance computing
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment
SIAM Journal on Scientific Computing
Fault tolerant algorithms for heat transfer problems
Journal of Parallel and Distributed Computing
Algorithm-Based Fault Tolerance for Fail-Stop Failures
IEEE Transactions on Parallel and Distributed Systems
International Journal of High Performance Computing Applications
Reliable, scalable tree-based overlay networks
Reliable, scalable tree-based overlay networks
Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing
IEEE Transactions on Computers
International Journal of High Performance Computing Applications
Optimal real number codes for fault tolerant matrix operations
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Numerically stable real number codes based on random matrices
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
Robust distributed orthogonalization based on randomized aggregation
Proceedings of the second workshop on Scalable algorithms for large-scale systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
kMemvisor: flexible system wide memory mirroring in virtual environments
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
When is multi-version checkpointing needed?
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
In today's high performance computing practice, fail-stop failures are often tolerated by checkpointing. While checkpointing is a very general technique and can often be applied to a wide range of applications, it often introduces a considerable overhead especially when computations reach petascale and beyond. In this paper, we show that, for many iterative methods, if the parallel data partitioning scheme satisfies certain conditions, the iterative methods themselves will maintain enough inherent redundant information for the accurate recovery of the lost data without checkpointing. We analyze the block row data partitioning scheme for sparse matrices and derive a sufficient condition for recovering the critical data without checkpointing. When this sufficient condition is satisfied, neither checkpoint nor roll-back is necessary for the recovery. Furthermore, the fault tolerance overhead (time) is zero if no actual failures occur during a program execution. Overhead is introduced only when an actual failure occurs. Experimental results demonstrate that, when it works, the proposed scheme introduces much less overhead than checkpointing on the current world's eighth-fastest supercomputer Kraken.