New user-guided and ckpt-based checkpointing libraries for parallel MPI applications,

  • Authors:
  • Paweł Czarnul;Marcin Frączak

  • Affiliations:
  • Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, Poland;Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, Poland

  • Venue:
  • PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present design and implementation details as well as performance results for two new parallel checkpointing libraries developed by us for parallel MPI applications. The first one, a user-guided library requires from the programmer to support packing and unpacking code with an easy-to-use API using MPI constants. It uses MPI-2 collective I/O calls or a dedicated master process for checkpointing. The other version is a technically advanced parallel implementation of checkpointing based on the user-level ckpt library. It uses wrappers for MPI calls in the user program which enables to run a shadow MPI application just for communication purposes. Communication between original processes and the shadow MPI code is done via shared memory segments to which communication buffers are mapped. We present checkpoint/restart times for the two approaches and subversions proposed by us compared to an available LAMMPI/BLCR checkpointing solution for MPI applications. The performance of all the versions and I/O optimizations are discussed for a 4-node, 16-processor cluster with NFS and specifically for single SMP nodes with a local file system.