A 1 PB/s file system to checkpoint three million MPI tasks

  • Authors:
  • Raghunath Rajachandrasekar;Adam Moody;Kathryn Mohror;Dhabaleswar K. (DK) Panda

  • Affiliations:
  • The Ohio State University, Columbus, OH, USA;Lawrence Livermore National Laboratory, Livermore, CA, USA;Lawrence Livermore National Laboratory, Livermore, CA, USA;The Ohio State University, Columbus, OH, USA

  • Venue:
  • Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the massive scale of high-performance computing systems, long-running scientific parallel applications periodically save the state of their execution to files called checkpoints to recover from system failures. Checkpoints are stored on external parallel file systems, but limited bandwidth makes this a time-consuming operation. Multilevel checkpointing systems, like the Scalable Checkpoint/Restart (SCR) library, alleviate this bottleneck by caching checkpoints in storage located close to the compute nodes. However, most large scale systems do not provide file storage on compute nodes, preventing the use of SCR. We have implemented a novel user-space file system that stores data in main memory and transparently spills over to other storage, like local flash memory or the parallel file system, as needed. This technique extends the reach of libraries like SCR to systems where they otherwise could not be used. Furthermore, we expose file contents for Remote Direct Memory Access, allowing external tools to copy checkpoints to the parallel file system in the background with reduced CPU interruption. Our file system scales linearly with node count and delivers a 1~PB/s throughput at three million MPI processes, which is 20x faster than the system RAM disk and 1000x faster than the parallel file system.