Sustained systems performance monitoring at the U. S. Department of Defense high performance computing modernization program

  • Authors:
  • Paul M. Bennett

  • Affiliations:
  • U.S. DoD High Performance Computing Modernization Program, Vicksburg, MS

  • Venue:
  • State of the Practice Reports
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The U. S. Department of Defense High Performance Computing Modernization Program (HPCMP) has implemented sustained systems performance testing on high performance computing systems in use at DoD Supercomputing Resource Centers. The intent is to monitor performance improvements by updates to the operating system, compiler suites, and numerical and communications libraries, and to monitor penalties arising from security patches. In practice, each system's workload is simulated by appropriate choices of user application codes representative of the HPCMP computational technical areas. Past successes include surfacing an imminent failure of an OST in a Cray XT3, incomplete configuration of a scheduler update on an SGI Altix 4700, performance issues associated with a communications library update for a Linux Networx Advanced Technology Cluster, and intermittent resetting of Intel Nehalem cores to standard mode from turbo mode. This history demonstrates that SSP testing is critical to deliver the highest quality of service to the HPCMP users.