User-level checkpoint and recovery for LAM/MPI

  • Authors:
  • Youhui Zhang;Dongsheng Wong;Weimin Zheng

  • Affiliations:
  • Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China

  • Venue:
  • ACM SIGOPS Operating Systems Review
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. We integrated one user-level checkpointing and rollback recovery (CRR) library to LAM/MPI, a high performance implementation of the Message Passing Interface (MPI), to improve its availability. Compared with the current CRR implementation of LAM/MPI, our work supports file checkpointing and own higher portability, which can run on more platforms including IA32 and IA64 Linux. In addition, the test shows that less than 15% performance overhead is introduced by the CRR mechanism of our implementation.