Enhancing Checkpoint Performance with Staging IO and SSD

Authors:
Xiangyong Ouyang;Sonya Marcarelli;Dhabaleswar K. Panda
Affiliations:
-;-;-
Venue:
SNAPI '10 Proceedings of the 2010 International Workshop on Storage Network Architecture and Parallel I/Os
Year:
2010

Citing 0
Cited 5

Virtualized HPC: a contradiction in terms?

Software—Practice & Experience
Revisiting widely held SSD expectations and rethinking system-level implications

Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
Exploring reliability of exascale systems through simulations

Proceedings of the High Performance Computing Symposium
Triple-A: a Non-SSD based autonomic all-flash array for high performance storage systems

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Active flash: towards energy-efficient, in-situ data analytics on extreme-scale machines

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the ever-growing size of computer clusters and applications, system failures are becoming inevitable. Checkpointing, a strategy to ensure fault tolerance, has become imperative in such an environment. Howeverexisting mechanism of checkpoint writing to parallel systems doesn't perform well with increasing job size. Solid State Disk(SSD) is attracting more and more attention due to its technical merits such as good random access performance, low power consumption and shock resistance. However, how to apply SSDs into a parallel storage system to improve checkpoint writing still remains an open question. In this paper we propose a new strategy to enhance checkpoint writing performance by aggregating checkpoint writing at client side, and utilizing staging IO on data servers. We also explore the potentials to substitute traditional hard disks with SSDs on data server to achieve better write bandwidth. Our strategy achieves up to 6.3 times higher write bandwidth than a popular parallel file system PVFS2 with 8 client nodes and 4 data servers. In experiments with real applications using 64 application processes and 4 data servers, our strategy can accelerate checkpoint writing by up to 9.9 times compared to PVFS2.