Process Migration for MPI Applications based on Coordinated Checkpoint

  • Authors:
  • Jiannong Cao;Yinghao Li;Minyi Guo

  • Affiliations:
  • Department of Computing The Hong Kong Polytechnic University Kowloon, Hung Hom, Hong Kong, China PR;Department of Computing The Hong Kong Polytechnic University Kowloon, Hung Hom, Hong Kong, China PR;Department of Computer Software The University of Aizu Aizu-Wakamatsu City

  • Venue:
  • ICPADS '05 Proceedings of the 11th International Conference on Parallel and Distributed Systems - Volume 01
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

A lot of research has been done on faulttolerance for MPI applications, some on checkpoint/restart, and some on network faulttolerance. Process migration, however, has not gained widespread use due to the additional complexity of the requirement that the knowledge about the new location of a migrated process has to be made known to every other process in the application. Here we present a simple yet effective method of process migration based on coordinated checkpointing of MPI applications. Migration is achieved by checkpointing the application, modifying the process location information in the checkpoint files, and restarting the application. Checkpoint/restart and migration are transparent to MPI applications. Performance evaluation results showed that the additional checkpoint/restart capability has little impact on application performance, and the migration method scales well on a large number of nodes.