Migol: a fault-tolerant service framework for MPI applications in the grid

  • Authors:
  • André Luckow;Bettina Schnor

  • Affiliations:
  • Institute of Computer Science, University Potsdam, Germany;Institute of Computer Science, University Potsdam, Germany

  • Venue:
  • PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

In a distributed, inherently dynamic Grid environment the reliability of individual resources cannot be guaranteed. The more resources and components are involved the more error-prone is the system. Therefore, it is important to enhance the dependability of the system with fault-tolerance mechanisms. In this paper, we present Migol, a fault-tolerant, self-healing Grid service infrastructure for MPI applications. The benefit of the Grid is that in case of a failure an application may be migrated and restarted from a checkpoint file on another site. This approach requires a service infrastructure which handles the necessary activities transparently for an application. But any migration framework cannot support fault-tolerant applications, if it is not fault-tolerant itself.