Parallel File System Testing for the Lunatic Fringe: The Care and Feeding of Restless I/O Power Users

Authors:
Richard Hedges;Bill Loewe;Tyce McLarty;Chris Morrone
Affiliations:
Lawrence Livermore National Laboratory;Lawrence Livermore National Laboratory;Lawrence Livermore National Laboratory;Lawrence Livermore National Laboratory
Venue:
MSST '05 Proceedings of the 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies
Year:
2005

Citing 0
Cited 7

Towards an I/O tracing framework taxonomy

PDSW '07 Proceedings of the 2nd international workshop on Petascale data storage: held in conjunction with Supercomputing '07
PLFS: a checkpoint filesystem for parallel applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
I/O performance challenges at leadership scale

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
High performance multi-node file copies and checksums for clustered file systems

LISA'10 Proceedings of the 24th international conference on Large installation system administration
McrEngine: a scalable checkpointing system using data-aware aggregation and compression

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
McrEngine: A scalable checkpointing system using data-aware aggregation and compression

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

Over the last several years there has been a major thrust at the Lawrence Livermore National Laboratory toward building extremely large scale computing clusters based on open source software and commodity hardware. On the storage front, our efforts have focused upon the development of the Lustre file system and bringing it into production in our computer center. Given our customers' requirements, it is assured that we will be living on the bleeding edge with this file system software as we press it into production. A further reality is that our partners are not able to duplicate the scale of systems as required for these testing purposes. For these practical reasons, the onus for file system testing at scale has fallen largely upon us. As an integral part of our testing efforts, we have developed programs for stress and performance testing of parallel file systems. This paper focuses on these unique test programs and upon how we apply them to understand the usage and failure modes of such large-scale parallel file systems.