Parallel checkpoint/recovery on cluster of IA-64 computers

  • Authors:
  • Youhui Zhang;Dongsheng Wang;Weimin Zheng

  • Affiliations:
  • Department of Computer Science, Tsinghua Univ., Beijing, P.R.C;Department of Computer Science, Tsinghua Univ., Beijing, P.R.C;Department of Computer Science, Tsinghua Univ., Beijing, P.R.C

  • Venue:
  • ISPA'04 Proceedings of the Second international conference on Parallel and Distributed Processing and Applications
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

We design and implement a high availability parallel run-time system—ChaRM64, a Checkpoint- based Rollback Recovery and Migration system for parallel running programs on a cluster of IA-64 computers. At first, we discuss our solution of a user-level, single process checkpoint/recovery library running on IA-64 systems. Based on this library, ChaRM64 is realized, which implements a user-transparent, coordinated checkpointing and rollback recovery (CRR) mechanism, quasi-asynchronous migration and the dynamic reconfiguration function. Owing to the above techniques and efficient error detection, ChaRM64 can handle cluster node crashes and hardware transient faults in a IA-64 cluster. Now ChaRM64 for PVM has been implemented in Linux and the MPI version is under construction. As we know, there are few similar projects accomplished for IA-64 architecture.