SPOTlight on testing: stability, performance and operational testing of LANL HPC clusters

  • Authors:
  • Georgia Pedicini;Jennifer Green

  • Affiliations:
  • Los Alamos National Laboratory, Los Alamos, NM;Los Alamos National Laboratory, Los Alamos, NM

  • Venue:
  • State of the Practice Reports
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Testing is sometimes a forgotten component of system management, but it becomes very important in the realm of High Performance Computing (HPC) clusters. Many large-scale HPC cluster installations are one of a kind, with unknown issues and unexpected behaviors. First, the initial installation may uncover complex configuration interactions that are only apparent at scale; Stability becomes a critical feature of early system testing. Second, Performance may be significantly impacted by small changes to the system. Third, after initial shakeout, users expect a system that is reliable on their terms; ongoing Operational tests verify reliability, and provide early warning of developing problems. A robust test suite should address all of these test categories, and present both tests and results in a manner that meets usability requirements. We will describe Los Alamos National Laboratory's current test suite, and the development project to expand the suite to cover these areas and provide better tools for analysis and reporting.