Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing

Authors:
Zizhong Chen;Jack Dongarra
Affiliations:
Colorado School of Mines, Golden;University of Tennessee, Knoxville
Venue:
IEEE Transactions on Computers
Year:
2009

Citing 0
Cited 6

Optimal real number codes for fault tolerant matrix operations

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
High performance linpack benchmark: a fault tolerant implementation without checkpointing

Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing

Proceedings of the 20th international symposium on High performance distributed computing
Scalable distributed consensus to support MPI fault tolerance

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Fault tolerant preconditioned conjugate gradient for sparse linear system solution

Proceedings of the 26th ACM international conference on Supercomputing
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	15.00

Visualization

Abstract

As the number of processors in today's high-performance computers continues to grow, the mean-time-to-failure of these computers is becoming significantly shorter than the execution time of many current high-performance computing applications. Although today's architectures are usually robust enough to survive node failures without suffering complete system failure, most of today's high-performance computing applications cannot survive node failures. Therefore, whenever a node fails, all surviving processes on surviving nodes usually have to be aborted and the whole application has to be restarted. In this paper, we present a framework for building self-healing high-performance numerical computing applications so that they can adapt to node or link failures without aborting themselves. The framework is based on FT-MPI and diskless checkpointing. Our diskless checkpointing uses weighted checksum schemes, a variation of Reed-Solomon erasure codes over floating-point numbers. We introduce several scalable encoding strategies into the existing diskless checkpointing and reduce the overhead to survive k failures in p processes from 2 \lceil \log p \rceil . k ((\beta + 2\gamma ) m + \alpha ) to (1 + O ({\sqrt{p}\over \sqrt{m}} ) )^2 . k (\beta + 2\gamma ) m, where \alpha is the communication latency, {1\over \beta } is the network bandwidth between processes, {1\over \gamma } is the rate to perform calculations, and m is the size of local checkpoint per process. When additional checkpoint processors are used, the overhead can be reduced to (1 + O ({1\over \sqrt{m}} ) ) . k (\beta + 2\gamma ) m, which is independent of the total number of computational processors. The introduced self-healing algorithms are scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of our self-healing approach by using a preconditioned conjugate gradient equation solver as an example. Experimental results demonstrate that our self-healing scheme can survive multiple simultaneous process failures with low-performance overhead and little numerical impact.