Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment

Authors:
Hui Jin;Tao Ke;Yong Chen;Xian-He Sun
Affiliations:
-;-;-;-
Venue:
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Year:
2012

Citing 12
Cited 0

A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Data Sieving and Collective I/O in ROMIO

FRONTIERS '99 Proceedings of the The 7th Symposium on the Frontiers of Massively Parallel Computation
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Modeling the Impact of Checkpoints on Next-Generation Systems

MSST '07 Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies
Interconnect agnostic checkpoint/restart in open MPI

Proceedings of the 18th ACM international symposium on High performance distributed computing
Toward Exascale Resilience

International Journal of High Performance Computing Applications
PLFS: a checkpoint filesystem for parallel applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems

ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing HPC Fault-Tolerant Environment: An Analytical Approach

ICPP '10 Proceedings of the 2010 39th International Conference on Parallel Processing
GPFS: a shared-disk file system for large computing clusters

FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Check pointing is widely used in technical computing. However, the overhead of check pointing is a subject of increasing in concern in recent years, especially for large-scale parallel computer systems. In these systems, check pointing generates a huge number of concurrent I/O writes. The burst of writes plus the worsening I/O-wall problem often leads to network and I/O congestion, and makes the overall system performance painfully slow. Recognizing contention as a dominant performance factor, in this paper we propose a systematic approach named check pointing orchestration to reduce write contention, which combines the marshaling of concurrent checkpoint requests and the adopting of vertical data access in coordination. A prototype of the proposed check pointing orchestration approach has been implemented at the system-level under Open MPI over the PVFS2 file system. Extensive experiments based on NPB benchmarks have been conducted to verify the design and implementation. Experimental results show that check pointing orchestration reduced the check pointing cost at a degree of more than 30%. Check pointing cost was halved for 4 out of 5 the C class NPB benchmarks.