The Fault Tolerant Parallel Algorithm: the Parallel Recomputing Based Failure Recovery

Authors:
Xuejun Yang;Yunfei Du;Panfeng Wang;Hongyi Fu;Jia Jia;Zhiyuan Wang;Guang Suo
Affiliations:
National University of Defense Technology, China;National University of Defense Technology, China;National University of Defense Technology, China;National University of Defense Technology, China;National University of Defense Technology, China;National University of Defense Technology, China;National University of Defense Technology, China
Venue:
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Year:
2007

Citing 0
Cited 3

Automated application-level checkpointing based on live-variable analysis in MPI programs

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Effective runtime scalability metric to measure productivity in high performance computing systems

Proceedings of the 5th conference on Computing frontiers
An effective speedup metric for measuring productivity in large-scale parallel computer systems

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses the issue of fault tolerance in parallel computing, and proposes a new method named parallel recomputing. Such method achieves fault recovery automatically by using surviving processes to recompute the workload of failed processes in parallel. The paper firstly defines the fault tolerant parallel algorithm (FTPA) as the parallel algorithm which tolerates failures by parallel recomputing. Furthermore, the paper proposes the inter-process definition-use relationship analysis method based on the conventional definition-use analysis for revealing the relationship of variables in different processes. Under the guidance of this new method, principles of fault tolerant parallel algorithm design are given. At last, the authors present the design of FTPAs for matrix-matrix multiplication and NPB kernels, and evaluate them by experiments on a cluster system. The experimental results show that the overhead of FTPA is less than the overhead of checkpointing.