Migol: A fault-tolerant service framework for MPI applications in the grid

  • Authors:
  • André Luckow;Bettina Schnor

  • Affiliations:
  • Institute of Computer Science, University of Potsdam, Germany;Institute of Computer Science, University of Potsdam, Germany

  • Venue:
  • Future Generation Computer Systems
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Especially for sciences the provision of massive parallel CPU capacity is one of the most attractive features of a grid. A major challenge in a distributed, inherently dynamic grid is fault tolerance. The more resources and components involved, the more complicated and error-prone becomes the system. In a grid with potentially thousands of machines connected to each other the reliability of individual resources cannot be guaranteed. The benefit of the grid is that in case of a failure an application may be migrated and restarted from a checkpoint file on another site. This approach requires a service infrastructure which handles the necessary activities transparently. In this article, we present Migol, a fault-tolerant and self-healing grid middleware for MPI applications. Migol is based on open standards and extends the services of the Globus toolkit to support the fault tolerance of grid applications. Further, the Migol framework itself is designed with special focus on fault tolerance. For example, Migol replicates critical services and uses a ring-based replication protocol to achieve data consistency.