TH-MPI: OS Kernel Integrated Fault Tolerant MPI

  • Authors:
  • Yu Chen;Qian Fang;Zhihui Du;Sanli Li

  • Affiliations:
  • -;-;-;-

  • Venue:
  • Proceedings of the 8th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Consisting of large numbers of computing nodes, parallel cluster systems have high risks of individual node failure. To overcome the high overhead drawbacks of current fault tolerant MPI systems, this paper presents TH-MPI for parallel cluster systems. Being integrated into Linux kernel, THMPI is implemented in a more effective, transparent and extensive way. With supports of dynamic kernel module and diskless checkpointing technologies, our experiment shows that checkpointing in TH-MPI is effectively optimized.