Fault-Aware Runtime Strategies for High-Performance Computing

Authors:
Yawei Li;Zhiling Lan;Prashasta Gujrati;Xian-He Sun
Affiliations:
Illinois Institute of Technology, Chicago;Illinois Institute of Technology, Chicago;Illinois Institute of Technology, Chicago;Illinois Institute of Technology, Chicago
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2009

Citing 0
Cited 4

A study of dynamic meta-learning for failure prediction in large-scale systems

Journal of Parallel and Distributed Computing
Proactive process-level live migration and back migration in HPC environments

Journal of Parallel and Distributed Computing
Checkpointing algorithms and fault prediction

Journal of Parallel and Distributed Computing
A job submission manager for large-scale distributed systems based on job futurity predictor

International Journal of Grid and Utility Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the scale of parallel systems continues to grow, fault management of these systems is becoming a critical challenge. While existing research mainly focuses on developing or improving fault tolerance techniques, a number of key issues remain open. In this paper, we propose runtime strategies for spare node allocation and job rescheduling in response to failure prediction. These strategies, together with failure predictor and fault tolerance techniques, construct a runtime system called FARS (Fault-Aware Runtime System). In particular, we propose a 0-1 knapsack model and demonstrate its flexibility and effectiveness for reallocating running jobs to avoid failures. Experiments, by means of synthetic data and real traces from production systems, show that FARS has the potential to significantly improve system productivity (i.e., performance and reliability).