Scalable locking and recovery for network file systems

  • Authors:
  • Peter J. Braam

  • Affiliations:
  • Sun Microsystems, Inc., Broomfield, CO

  • Venue:
  • PDSW '07 Proceedings of the 2nd international workshop on Petascale data storage: held in conjunction with Supercomputing '07
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Petascale computing systems pose serious scalability challenges for any data storage system. Lustre is a scalable, secure, robust, highly-available cluster file system that has been successfully deployed on some of the largest supercomputing systems in the world, including the BlueGene/L supercomputer at the Lawrence Livermore National Laboratory (LLNL), the Red Storm supercluster at Sandia National Laboratories and the Jaguar supercomputer at the Oak Ridge National Laboratory. This paper provides file system developers with insight into how network file system scalability is addressed in the Lustre file system through policies and algorithms that support distributed lock management and options for facilitating recovery after a compute node failure in a large scale cluster. These design approaches can be applied to the scaling of other file systems to support large clusters.