Building Single Fault Survivable Parallel Algorithms for Matrix Operations Using Redundant Parallel Computation

Authors:
Yunfei Du;Panfeng Wang;Hongyi Fu;Jia Jia;Haifang Zhou;Xuejun Yang
Affiliations:
National University of Defense Technology, Changsha, Hunan, 410073, China;National University of Defense Technology, Changsha, Hunan, 410073, China;National University of Defense Technology, Changsha, Hunan, 410073, China;National University of Defense Technology, Changsha, Hunan, 410073, China;National University of Defense Technology, Changsha, Hunan, 410073, China;National University of Defense Technology, Changsha, Hunan, 410073, China
Venue:
CIT '07 Proceedings of the 7th IEEE International Conference on Computer and Information Technology
Year:
2007

Citing 0
Cited 1

Analyzing fault aware collective performance in a process fault tolerant MPI

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the size of today's high performance computers con- tinue to grow, node failures in these computers are becom- ing frequent events. Although checkpoint is the typical tech- nique to tolerate such failures, it often introduces a consid- erable overhead and has shown poor scalability on today's large scale systems. In this paper we defined a new term called fault toler- ant parallel algorithm which means that the algorithm gets the correct answer despite the failure of nodes. The fault tolerance approach in which the data of failed processes is recovered by modifying applications to recompute on all surviving processes is checkpoint-free. In particular, if no failure occurs, the fault tolerant parallel algorithms are the same as the original algorithms. We show the practicality of this technique by applying it to parallel dense matrix-matrix multiplication and Gaussian elimination to tolerate single process failure. Experimental results demonstrate that a process failure can be tolerated with a good scalability for the two fault tolerant parallel algorithms and the proposed fault tolerant parallel dense matrix-matrix multiplication is able to survive process failure with a very low perfor- mance overhead. The main drawback of this approach is non-transparent and algorithm-dependent.