A study of dynamic meta-learning for failure prediction in large-scale systems
Journal of Parallel and Distributed Computing
Proactive process-level live migration and back migration in HPC environments
Journal of Parallel and Distributed Computing
HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Performance comparison under failures of MPI and MapReduce: An analytical approach
Future Generation Computer Systems
A job submission manager for large-scale distributed systems based on job futurity predictor
International Journal of Grid and Utility Computing
Hi-index | 14.98 |
As the scale of high performance computing (HPC) grows, application fault resilience becomes increasingly important. In this paper, we propose FT-Pro, an adaptive fault management approach that combines the merits of reactive checkpointing and proactive migration. It enables parallel applications to avoid anticipated failures via preventive migration, and in the case of unforeseeable failures, to minimize their impact through selective checkpointing. An adaptation manager is designed for making runtime decision in response to failure prediction. We evaluate FT-Pro through stochastic modeling and case studies with real applications under a wide range of settings. Preliminary results indicate that FT-Pro outperforms periodic checkpointing, in terms of both reducing application completion times and improving resource utilization, by up to 43%.