Scalable locking and recovery for network file systems

Authors:
Peter J. Braam
Affiliations:
Sun Microsystems, Inc., Broomfield, CO
Venue:
PDSW '07 Proceedings of the 2nd international workshop on Petascale data storage: held in conjunction with Supercomputing '07
Year:
2007

Citing 0
Cited 2

Using server-to-server communication in parallel file systems to simplify consistency and improve performance

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Scale and concurrency of GIGA+: file system directories with millions of files

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Petascale computing systems pose serious scalability challenges for any data storage system. Lustre is a scalable, secure, robust, highly-available cluster file system that has been successfully deployed on some of the largest supercomputing systems in the world, including the BlueGene/L supercomputer at the Lawrence Livermore National Laboratory (LLNL), the Red Storm supercluster at Sandia National Laboratories and the Jaguar supercomputer at the Oak Ridge National Laboratory. This paper provides file system developers with insight into how network file system scalability is addressed in the Lustre file system through policies and algorithms that support distributed lock management and options for facilitating recovery after a compute node failure in a large scale cluster. These design approaches can be applied to the scaling of other file systems to support large clusters.