In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes

  • Authors:
  • Gang Wang;Xiaoguang Liu;Ang Li;Fan Zhang

  • Affiliations:
  • Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University, Tianjin, China 300071;Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University, Tianjin, China 300071;Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University, Tianjin, China 300071;Nankai-Baidu Joint Lab, College of Information Technical Science, Nankai University, Tianjin, China 300071

  • Venue:
  • Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Today, the scale of High performance computing (HPC) systems is much larger than ever. This brings a challenge to fault tolerance of HPC systems. MPI (Message Passing Interface) is one of the most important programming tools for HPC. There are quite a few fault-tolerant extensions for MPI, such as MPICH-V, StarFish, FT-MPI and so on. Most of them are based on on-disk checkpointing. In this paper, we apply two kinds of XOR-based double-erasure codes - RDP (Row-Diagonal Parity) and B-Code to in-memory checkpointing for MPI programs. We develop scalable checkpointing/recovery algorithms which embed erasure code encoding/decoding computation into MPI collective communications operations. The experiments show that the scalable algorithms decrease communication overhead and balance computation effectively. Our approach provides highly reliable, fast in-memory checkpointing for MPI programs.