Optimizing HPC Fault-Tolerant Environment: An Analytical Approach

Authors:
Hui Jin;Yong Chen;Huaiyu Zhu;Xian-He Sun
Affiliations:
-;-;-;-
Venue:
ICPP '10 Proceedings of the 2010 39th International Conference on Parallel Processing
Year:
2010

Citing 0
Cited 3

Towards scalable I/O architecture for exascale systems

Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Performance comparison under failures of MPI and MapReduce: An analytical approach

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The increasingly large ensemble size of modern High-Performance Computing (HPC) systems has drastically increased the possibility of failures. Performance under failures and its optimization become timely important issues facing the HPC community. In this study, we propose an analytical model to predict the application performance. The model characterizes the impact of coordinated checkpointing and system failures on application performance, considering all the factors including workload, the number of nodes, failure arrival rate, recovery cost, and checkpointing interval and overhead. Based on the model, we gauge three parameters, the number of compute nodes, checkpointing interval, and the number of spare nodes to conduct a comprehensive study of performance optimization under failures. Performance scalability under failures is also studied to explore the performance improvement space for different parameters. Experimental results from both synthetic and actual system failure logs confirm that the proposed model and optimization methodologies are effective and feasible.