Petascale I/O: challenges, solutions, and recommendations

  • Authors:
  • Lonnie D. Crosby;Rick Mohr

  • Affiliations:
  • University of Tennessee, Knoxville, Oak Ridge, TN;University of Tennessee, Knoxville, Oak Ridge, TN

  • Venue:
  • Proceedings of the Extreme Scaling Workshop
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Computing platforms over the past decade have quickly traversed the teraflop regime and have entered the petaflop era by exploiting higher processor clock frequencies, increasingly multi-core processors, and various accelerator technologies. However, the file system resources have not kept pace with these advances in computational power. As systems have grown in size and capability, file system resources have become increasingly under-provisioned in terms of capacity, bandwidth, and parallelism. The National Institute for Computational Sciences (NICS) has experienced this situation first hand. Over the past three years, Kraken has undergone two upgrades which have increased its computational capabilities, yet the file system configuration remained unchanged from its initial deployment. The computational capability of Kraken, in terms of peak flop/s, has increased by 93.2% between 2009 and 2011. This has presented significant challenges for both scientific applications and system administration. This paper will discuss IO issues encountered by scientific applications which largely revolve around bandwidth and parallelism. As the size of Kraken has increased, more and more applications are competing for the fixed file system bandwidth. This increases the possibility of saturating the file system, and in turn, decreasing individual application performance. We present some examples encountered by NICS computational scientists while working with user applications as well as the approaches used to mitigate such issues. In addition, we will discuss the importance of file system capacity for large scale computational resources. As capacity decreases relative to the computational resource's ability to produce data, policies such as file system purges are necessary to maintain the usability of connected resources. This data curation issue feeds requirements such as longer term archival storage and the ability to move large amounts of data between resources. Institution of globally shared file systems is, additionally, a manifestation of the data mobility requirement. Finally, we discuss our experiences with monitoring application behavior, tuning file system parameters, and conclude with recommendations with respect to future file system deployments and directions for scientific applications.