Best practices for the deployment and management of production HPC clusters

  • Authors:
  • Robert McLay;Karl W. Schulz;William L. Barth;Tommy Minyard

  • Affiliations:
  • The University of Texas at Austin;The University of Texas at Austin;The University of Texas at Austin;The University of Texas at Austin

  • Venue:
  • State of the Practice Reports
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Commodity-based Linux HPC clusters dominate the scientific computing landscape in both academia and industry ranging from small research clusters to petascale supercomputers supporting thousands of users. To support broad user communities and manage a user-friendly environment, end-user sites must combine a range of low-level system software with multiple compiler chains, support libraries, and a suite of 3rd party applications. In addition, large systems require bare metal provisioning and a flexible software management strategy to maintain consistency and upgrade-ability across thousands of compute nodes. This report documents a Linux operating system framework, (LosF), which has evolved over the last seven years to provide an integrated strategy for the deployment of multiple HPC systems at the Texas Advanced Computing Center. Documented within this effort is the high-level cluster configuration options and definitions, bare-metal provisioning, hierarchical HPC software stack design, package-management, user environment management tools, user account synchronization, and local customization configurations.