Resilience for collaborative applications on clouds: fault-tolerance for distributed HPC applications

  • Authors:
  • Toàn Nguyên;Jean-Antoine Désidéri

  • Affiliations:
  • Project OPALE, INRIA, Saint-Ismier, France;Project OPALE, INRIA, Saint-Ismier, France

  • Venue:
  • ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part IV
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Because e-Science applications are data intensive and require long execution runs, it is important that they feature fault-tolerance mechanisms. Cloud and grid computing infrastructures often support system and network fault-tolerance. They repair and prevent communication and software errors. They allow also checkpointing of applications, duplication of jobs and data to prevent catastrophic hardware failures. However, only preliminary work has been done so far on application resilience, i.e., the ability to resume normal execution following application errors and abnormal executions. This paper is an overview of open issues and solutions for such errors detection and management. It also overviews the implementation of a workflow management system to design, deploy, execute, monitor, restart and resume distributed HPC applications on cloud infrastructures in cases of failures.