Resilience for collaborative applications on clouds: fault-tolerance for distributed HPC applications

Authors:
Toàn Nguyên;Jean-Antoine Désidéri
Affiliations:
Project OPALE, INRIA, Saint-Ismier, France;Project OPALE, INRIA, Saint-Ismier, France
Venue:
ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part IV
Year:
2012

Citing 12
Cited 0

Managing Large-Scale Scientific Workflows in Distributed Environments: Experiences and Challenges

E-SCIENCE '06 Proceedings of the Second IEEE International Conference on e-Science and Grid Computing
Dimensions of coupling in middleware

Concurrency and Computation: Practice & Experience
CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems

ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Fault Tolerance and Recovery in Grid Workflow Management Systems

CISIS '10 Proceedings of the 2010 International Conference on Complex, Intelligent and Software Intensive Systems
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
The International Exascale Software Project roadmap

International Journal of High Performance Computing Applications
Open Source Software for Workflow Management: The Case of YAWL

IEEE Software
FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
BlobCR: efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
The malthusian catastrophe is upon us! are the largest HPC machines ever up?

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Simulating application resilience at exascale

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Cooperative Application/OS DRAM fault recovery

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

Because e-Science applications are data intensive and require long execution runs, it is important that they feature fault-tolerance mechanisms. Cloud and grid computing infrastructures often support system and network fault-tolerance. They repair and prevent communication and software errors. They allow also checkpointing of applications, duplication of jobs and data to prevent catastrophic hardware failures. However, only preliminary work has been done so far on application resilience, i.e., the ability to resume normal execution following application errors and abnormal executions. This paper is an overview of open issues and solutions for such errors detection and management. It also overviews the implementation of a workflow management system to design, deploy, execute, monitor, restart and resume distributed HPC applications on cloud infrastructures in cases of failures.