CprFS: a user-level file system to support consistent file states for checkpoint and restart

  • Authors:
  • Ruini Xue;Wenguang Chen;Weimin Zheng

  • Affiliations:
  • High Performance Computing Institution, Beijing, China;High Performance Computing Institution, Beijing, China;High Performance Computing Institution, Beijing, China

  • Venue:
  • Proceedings of the 22nd annual international conference on Supercomputing
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Checkpoint and Restart (CPR) is becoming critical to large scale parallel computers, whose Mean Time Between Failures (MTBF) may be much shorter than the execution times of the applications. The CPR mechanism should be able to store and recover the states of virtual memory, communication and files for the applications in a consistent way. However, many CPR tools ignore file states, which may cause errors for applications with file operations on recovery. Some CPR tools adopt library-based approaches or kernel-level file systems to deal with file states, but they only support limited types of file operations which are not sufficient for some applications. Moreover, many library-based approaches are not transparent to user applications because they wrap file APIs. Kernel-level file systems are difficult to deploy in production systems due to unnecessary overhead they may introduce to applications that do not need CPR. In this paper we propose a user-level file system, CprFS, to address these problems. As a file system, CprFS can guarantee transparency to user applications, and is convenient to support arbitrary file operations. It can be deployed on applications' demand to avoid intervention with other applications. Experimental results on micro-benchmarks and real-world applications show that CprFS introduces acceptable overhead and has little impact on checkpointing systems.