Enhancing Checkpoint Performance with Staging IO and SSD

  • Authors:
  • Xiangyong Ouyang;Sonya Marcarelli;Dhabaleswar K. Panda

  • Affiliations:
  • -;-;-

  • Venue:
  • SNAPI '10 Proceedings of the 2010 International Workshop on Storage Network Architecture and Parallel I/Os
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the ever-growing size of computer clusters and applications, system failures are becoming inevitable. Checkpointing, a strategy to ensure fault tolerance, has become imperative in such an environment. Howeverexisting mechanism of checkpoint writing to parallel systems doesn't perform well with increasing job size. Solid State Disk(SSD) is attracting more and more attention due to its technical merits such as good random access performance, low power consumption and shock resistance. However, how to apply SSDs into a parallel storage system to improve checkpoint writing still remains an open question. In this paper we propose a new strategy to enhance checkpoint writing performance by aggregating checkpoint writing at client side, and utilizing staging IO on data servers. We also explore the potentials to substitute traditional hard disks with SSDs on data server to achieve better write bandwidth. Our strategy achieves up to 6.3 times higher write bandwidth than a popular parallel file system PVFS2 with 8 client nodes and 4 data servers. In experiments with real applications using 64 application processes and 4 data servers, our strategy can accelerate checkpoint writing by up to 9.9 times compared to PVFS2.