Bounds on Algorithm-Based Fault Tolerance in Multiple Processor Systems
IEEE Transactions on Computers - The MIT Press scientific computation series
An analysis of algorithm-based fault tolerance techniques
Journal of Parallel and Distributed Computing
A Linear Algebraic Model of Algorithm-Based Fault Tolerance
IEEE Transactions on Computers
Algorithm-Based Fault Tolerance on a Hypercube Multiprocessor
IEEE Transactions on Computers
Fault-tolerant matrix operations for networks of workstations using diskless checkpointing
Journal of Parallel and Distributed Computing
IEEE Transactions on Parallel and Distributed Systems
A first order approximation to the optimum checkpoint interval
Communications of the ACM
Fault-Tolerant High-Performance Matrix Multiplication: Theory and Practice
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Evaluation of checkpoint mechanisms for massively parallel machines
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Fault-tolerant matrix operations for parallel and distributed systems
Fault-tolerant matrix operations for parallel and distributed systems
An Experimental Study about Diskless Checkpointing
EUROMICRO '98 Proceedings of the 24th Conference on EUROMICRO - Volume 1
Fault tolerant high performance computing by a coding approach
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Scalable diskless checkpointing for large parallel systems
Scalable diskless checkpointing for large parallel systems
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Algorithm-Based Fault Tolerance for Fail-Stop Failures
IEEE Transactions on Parallel and Distributed Systems
International Journal of High Performance Computing Applications
Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing
IEEE Transactions on Computers
Optimal real number codes for fault tolerant matrix operations
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
Distributed Diskless Checkpoint for Large Scale Systems
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Algorithm-based fault tolerance for dense matrix factorizations
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
An evaluation of user-level failure mitigation support in MPI
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Correcting soft errors online in LU factorization
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Parallel reduction to hessenberg form with algorithm-based fault tolerance
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For long running applications using a large number of processors, it is essential that fault tolerance be used to prevent a total loss of all finished computations after a failure. While checkpointing has been very useful to tolerate failures for a long time, it often introduces a considerable overhead especially when applications modify a large amount of memory between checkpoints and the number of processors is large. In this paper, we propose an algorithm-based recovery scheme for the High Performance Linpack benchmark (which modifies a large amount of memory in each iteration) to tolerate fail-stop failures without checkpointing. It was proved by Huang and Abraham that a checksum added to a matrix will be maintained after the matrix is factored. We demonstrate that, for the right-looking LU factorization algorithm, the checksum is maintained at each step of the computation. Based on this checksum relationship maintained at each step in the middle of the computation, we demonstrate that fail-stop process failures in High Performance Linpack can be tolerated without checkpointing. Because no periodical checkpoint is necessary during computation and no roll-back is necessary during recovery, the proposed recovery scheme is highly scalable and has a good potential to scale to extreme scale computing and beyond. Experimental results on the supercomputer Jaguar demonstrate that the fault tolerance overhead introduced by the proposed recovery scheme is negligible.