Can checkpoint/restart mechanisms benefit from hierarchical data staging?

Authors:
Raghunath Rajachandrasekar;Xiangyong Ouyang;Xavier Besseron;Vilobh Meshram;Dhabaleswar K. Panda
Affiliations:
Network-Based Computing Laboratory, Department of Computer Science and Engineering, The Ohio State University;Network-Based Computing Laboratory, Department of Computer Science and Engineering, The Ohio State University;Network-Based Computing Laboratory, Department of Computer Science and Engineering, The Ohio State University;Network-Based Computing Laboratory, Department of Computer Science and Engineering, The Ohio State University;Network-Based Computing Laboratory, Department of Computer Science and Engineering, The Ohio State University
Venue:
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Year:
2011

Citing 9
Cited 0

Memory exclusion: optimizing the performance of checkpointing systems

Software—Practice & Experience
Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
DataStager: scalable data staging services for petascale applications

Proceedings of the 18th ACM international symposium on High performance distributed computing
Toward Exascale Resilience

International Journal of High Performance Computing Applications
PLFS: a checkpoint filesystem for parallel applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems

ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Design and Evaluation of Multiple-Level Data Staging for Blue Gene Systems

IEEE Transactions on Parallel and Distributed Systems
CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given the ever-increasing size of supercomputers, fault resilience and the ability to tolerate faults have become more of a necessity than an option. Checkpoint-Restart protocols have been widely adopted as a practical solution to provide reliability. However, traditional checkpointing mechanisms suffer from heavy I/O bottleneck while dumping process snapshots to a shared filesystem. In this context, we study the benefits of data staging, using a proposed hierarchical and modular data staging framework which reduces the burden of checkpointing on client nodes without penalizing them in terms of performance. During a checkpointing operation in this framework, the compute nodes transmit their process snapshots to a set of dedicated staging I/O servers through a high-throughput RDMA-based data pipeline. Unlike the conventional checkpointing mechanisms that block an application until the checkpoint data has been written to a shared filesystem, we allow the application to resume its execution immediately after the snapshots have been pipelined to the staging I/O servers, while data is simultaneously being moved from these servers to a backend shared filesystem. This framework eases the bottleneck caused by simultaneous writes from multiple clients to the underlying storage subsystem. The staging framework considered in this study is able to reduce the time penalty an application pays to save a checkpoint by 8.3 times.