Proactive process-level live migration in HPC environments
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Evaluating recovery aware components for grid reliability
Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Journal of Parallel and Distributed Computing
Selective Recovery from Failures in a Task Parallel Programming Model
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
An MPI-based implementation of intelligent agents on clusters
SpringSim '10 Proceedings of the 2010 Spring Simulation Multiconference
Compiler-support for robust multi-core computing
ISoLA'10 Proceedings of the 4th international conference on Leveraging applications of formal methods, verification, and validation - Volume Part I
Optimized pre-copy live migration for memory intensive applications
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Proactive process-level live migration and back migration in HPC environments
Journal of Parallel and Distributed Computing
Ensuring reliability in B2B services: Fault tolerant inter-organizational workflows
Information Systems Frontiers
Hi-index | 0.00 |
Fault tolerance is a major concern to guarantee availability of critical services as well as application execution. Traditional approaches for fault tolerance include checkpoint/restart or duplication. However it is also possible to anticipate failures and proactively take action before failures occur in order to minimize failure impact on the system and application execution. This document presents a proactive fault tolerance framework. This framework can use different proactive fault tolerance mechanisms, i.e., migration and pause/unpause.The framework also allows the implementation of new proactive fault tolerance policies thanks to a modular architecture. A first proactive fault tolerance policy has been implemented and preliminary experimentations have been done based on system-level virtualization and compared with results obtained by simulation.