Fault-Driven Re-Scheduling For Improving System-level Fault Resilience

Authors:
Yawei Li;Prashasta Gujrati;Zhiling Lan;Xian-he Sun
Affiliations:
Illinois Institute of Technology, USA;Illinois Institute of Technology, USA;Illinois Institute of Technology, USA;Illinois Institute of Technology, USA/ Fermi National Accelerator Laboratory, USA
Venue:
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Year:
2007

Citing 0
Cited 7

Proactive process-level live migration in HPC environments

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

International Journal of High Performance Computing Applications
Toward Exascale Resilience

International Journal of High Performance Computing Applications
Predicting computer system failures using support vector machines

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
An MPI-based implementation of intelligent agents on clusters

SpringSim '10 Proceedings of the 2010 Spring Simulation Multiconference
Failure-aware workflow scheduling in cluster environments

Cluster Computing
Proactive process-level live migration and back migration in HPC environments

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is checkpointing. However, existing research shows that such a reactive fault tolerance approach can only improve system productivity marginally. Leveraging the recent progress made in the field of failure prediction, we propose fault-driven rescheduling (FARS) to improve system resilience to failures, and investigate the feasibility and effectiveness of utilizing failure prediction to dynamically adjust the placement of active jobs (e.g. running jobs) in response to failure prediction. In particular, a rescheduling algorithm is designed to enable effective job adjustment by evaluating performance impact of potential failures and rescheduling on user jobs. The proposed FARS complements existing research on fault-aware scheduling by allowing user jobs to avoid imminent failures at runtime. We evaluate FARS by using actual workloads and failure events collected from production HPC systems. Our preliminary results show the potential of FARS on improving system resilience to failures.