A hybrid fault tolerance scheme for EasyGrid MPI applications

  • Authors:
  • Jacques A. da Silva;Vinod E. F. Rebello

  • Affiliations:
  • Universidade Federal Fluminense (UFF), Niterói, RJ, Brazil;Universidade Federal Fluminense (UFF), Niterói, RJ, Brazil

  • Venue:
  • Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science
  • Year:
  • 2011

Quantified Score

Hi-index 0.01

Visualization

Abstract

Writing applications capable of executing efficiently in distributed systems is extremely difficult and tedious for inexperienced users. The resources may be heterogeneous, non-dedicated, and offered without any performance or availability guarantees. Systems capable of adapting the execution of an application to these characteristics are essential. The EasyGrid Application Management System (AMS) transforms cluster-based MPI applications into autonomic ones capable executing robustly and efficiently in distributed environments. This work describes a strategy to endow these autonomic MPI applications with the property of self-healing and thus be capable of withstanding multiple simultaneous crash faults of processes and/or processors. The extremely low intrusion cost of the proposed hybrid solution might now facilitate acceptance of fault tolerance techniques in large scale high performance applications.