Managing Large-Scale Scientific Workflows in Distributed Environments: Experiences and Challenges
E-SCIENCE '06 Proceedings of the Second IEEE International Conference on e-Science and Grid Computing
Dimensions of coupling in middleware
Concurrency and Computation: Practice & Experience
CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Fault Tolerance and Recovery in Grid Workflow Management Systems
CISIS '10 Proceedings of the 2010 International Conference on Complex, Intelligent and Software Intensive Systems
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
The International Exascale Software Project roadmap
International Journal of High Performance Computing Applications
FTI: high performance fault tolerance interface for hybrid systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
The malthusian catastrophe is upon us! are the largest HPC machines ever up?
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Simulating application resilience at exascale
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Cooperative Application/OS DRAM fault recovery
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Hi-index | 0.00 |
Because e-Science applications are data intensive and require long execution runs, it is important that they feature fault-tolerance mechanisms. Cloud and grid computing infrastructures often support system and network fault-tolerance. They repair and prevent communication and software errors. They allow also checkpointing of applications, duplication of jobs and data to prevent catastrophic hardware failures. However, only preliminary work has been done so far on application resilience, i.e., the ability to resume normal execution following application errors and abnormal executions. This paper is an overview of open issues and solutions for such errors detection and management. It also overviews the implementation of a workflow management system to design, deploy, execute, monitor, restart and resume distributed HPC applications on cloud infrastructures in cases of failures.