Performance analysis of Intel multiprocessors using astrophysics simulations

  • Authors:
  • Tyler A. Simon;William A. Ward, Jr.;Alan P. Boss

  • Affiliations:
  • NASA Center for Climate Simulation, Goddard Space Flight Center, Greenbelt, MDUSA;DoD High Performance Computing Modernization Program, Lorton, VAUSA;Department of Terrestrial Magnetism, Carnegie Institute of Washington, WAUSA

  • Venue:
  • Concurrency and Computation: Practice & Experience
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper provides a performance evaluation and investigation of the astrophysics code FLASH for a variety of Intel multiprocessors. This work was performed at the NASA Center for Computational Sciences (NCCS) on behalf of the Carnegie Institution of Washington (CIW) as a study preliminary to the acquisition of a high-performance computing (HPC) system at the CIW and for the NCCS itself to measure the relative performance of a recently acquired Intel Nehalem-based system against previously installed multicore HPC resources. A brief overview of computer performance evaluation is provided, followed by a description of the systems under test, a description of the FLASH test problem, and the test results. Additionally, the paper characterizes some of the effects of load imbalance imposed by adaptive mesh refinement. Copyright © 2012 John Wiley & Sons, Ltd. (It is possible to distribute the same number of MPI processes across nodes in various ways. For instance, a 128-process job may be run on a system with four cores per node by allocating 32 nodes with four processes per node (fully populated nodes), 64 nodes with two processes per node (two idle cores per node), or 128 nodes with one process per node (three idle cores per node). Spreading the processes out as in the latter two cases often results in faster interprocess communication and better memory performance because there are fewer processes on a node using the same off-node bandwidth. However, because this approach prevents the additional nodes from being used by other jobs and lowers overall system throughput, its use is not encouraged on NCCS systems, and jobs are charged for use of the entire node even if some cores are idle.)