A novel fault-tolerant parallel algorithm

Authors:
Panfeng Wang;Yunfei Du;Hongyi Fu;Haifang Zhou;Xuejun Yang;Wenjing Yang
Affiliations:
National Laboratory for Paralleling and Distributed Processing, College of Computer, National University of Defense Technology, Changsha, Hunan, China;National Laboratory for Paralleling and Distributed Processing, College of Computer, National University of Defense Technology, Changsha, Hunan, China;National Laboratory for Paralleling and Distributed Processing, College of Computer, National University of Defense Technology, Changsha, Hunan, China;National Laboratory for Paralleling and Distributed Processing, College of Computer, National University of Defense Technology, Changsha, Hunan, China;National Laboratory for Paralleling and Distributed Processing, College of Computer, National University of Defense Technology, Changsha, Hunan, China;National Laboratory for Paralleling and Distributed Processing, College of Computer, National University of Defense Technology, Changsha, Hunan, China
Venue:
APPT'07 Proceedings of the 7th international conference on Advanced parallel processing technologies
Year:
2007

Citing 7
Cited 0

Another view on parallel speedup

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
ickp: A Consistent Checkpointer for Multicomputers

IEEE Parallel & Distributed Technology: Systems & Technology
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Evaluation of checkpoint mechanisms for massively parallel machines

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The mean-time-between-failure of current high-performance computer systems is much shorter than the running times of many computational applications, whereas those applications are the main workload for those systems. Currently, checkpoint/restart is the most commonly used scheme for such applications to tolerate hardware failures. But this scheme has its performance limitation when the number of processors becomes much larger. In this paper, we propose a novel fault-tolerant parallel algorithm FPAPR. First, we introduce the basic idea of FPAPR. Second, we specify the details of how to implement a FPAPR program by using two NPB kernels as examples. Third, we theoretically analyze the overhead of FPAPR, and find out that the overhead of FPAPR decreases with the increase of the number of processors. At last, the experimental results on a 512-CPU cluster show the overhead introduced by the algorithm is very small.