Proactive process-level live migration in HPC environments
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A tunable holistic resiliency approach for high-performance computing systems
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Proactive process-level live migration and back migration in HPC environments
Journal of Parallel and Distributed Computing
Comparing checkpoint and rollback recovery schemes in a cluster system
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Exploring reliability of exascale systems through simulations
Proceedings of the High Performance Computing Symposium
Hi-index | 0.00 |
Various mechanisms for fault-tolerance (FT) are used today in order to reduce the impact of failures on application execution. In the case of system failure, standard FT mechanisms are checkpoint/restart (for reactive FT) and migration (for pro-active FT). However, each of these mechanisms create an overhead on application execution, overhead that for instance becomes critical on large-scale systems where previous studies have shown that applications may spend more time checkpointing state than performing useful work.