NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA

Authors:
Akira Nukada;Hiroyuki Takizawa;Satoshi Matsuoka
Affiliations:
-;-;-
Venue:
IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Year:
2011

Citing 0
Cited 3

FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Dynamic task-scheduling and resource management for GPU accelerators in medical imaging

ARCS'12 Proceedings of the 25th international conference on Architecture of Computing Systems
A virtual memory based runtime to support multi-tenancy in clusters with GPUs

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Today, CUDA is the de facto standard programming framework to exploit the computational power of graphics processing units (GPUs) to accelerate various kinds of applications. For efficient use of a large GPU-accelerated system, one important mechanism is checkpoint-restart that can be used not only to improve fault tolerance but also to optimize node/slot allocation by suspending a job on one node and migrating the job to another node. Although several checkpoint-restart implementations have been developed so far, they do not support CUDA applications or have some severe limitations for CUDA support. Hence, we present a checkpoint-restart library for CUDA that first deletes all CUDA resources before check pointing and then restores them right after check pointing. It is necessary to restore each memory chunk at the same memory address. To this end, we propose a novel technique that replays memory related API calls. The library supports both CUDA runtime API and CUDA driver API. Moreover, the library is transparent to applications, it is not necessary to recompile the applications for check pointing. This paper demonstrates that the proposed library can achieve checkpoint-restart of various applications at acceptable overheads, and the library also works for MPI applications such as HPL.