Proactive process-level live migration in HPC environments
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
International Journal of High Performance Computing Applications
International Journal of High Performance Computing Applications
Predicting computer system failures using support vector machines
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
An MPI-based implementation of intelligent agents on clusters
SpringSim '10 Proceedings of the 2010 Spring Simulation Multiconference
Failure-aware workflow scheduling in cluster environments
Cluster Computing
Proactive process-level live migration and back migration in HPC environments
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
The productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is checkpointing. However, existing research shows that such a reactive fault tolerance approach can only improve system productivity marginally. Leveraging the recent progress made in the field of failure prediction, we propose fault-driven rescheduling (FARS) to improve system resilience to failures, and investigate the feasibility and effectiveness of utilizing failure prediction to dynamically adjust the placement of active jobs (e.g. running jobs) in response to failure prediction. In particular, a rescheduling algorithm is designed to enable effective job adjustment by evaluating performance impact of potential failures and rescheduling on user jobs. The proposed FARS complements existing research on fault-aware scheduling by allowing user jobs to avoid imminent failures at runtime. We evaluate FARS by using actual workloads and failure events collected from production HPC systems. Our preliminary results show the potential of FARS on improving system resilience to failures.