Performance analysis of Intel multiprocessors using astrophysics simulations

Authors:
Tyler A. Simon;William A. Ward, Jr.;Alan P. Boss
Affiliations:
NASA Center for Climate Simulation, Goddard Space Flight Center, Greenbelt, MDUSA;DoD High Performance Computing Modernization Program, Lorton, VAUSA;Department of Terrestrial Magnetism, Carnegie Institute of Washington, WAUSA
Venue:
Concurrency and Computation: Practice & Experience
Year:
2012

Citing 9
Cited 0

Computer benchmarking: paths and pitfalls

IEEE Spectrum
Flash code: studying astrophysical thermonuclear flashes

Computing in Science and Engineering
Measuring computer performance: a practitioner's guide

Measuring computer performance: a practitioner's guide
Dhrystone: a synthetic systems programming benchmark

Communications of the ACM
High performance reactive fluid flow simulations using adaptive mesh refinement on thousands of processors

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
A case study in application I/O on Linux clusters

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Performance of Various Computers Using Standard Linear Equations Software

Performance of Various Computers Using Standard Linear Equations Software
Workload charaterization and Selection in Computer Performance Measurement

Computer

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper provides a performance evaluation and investigation of the astrophysics code FLASH for a variety of Intel multiprocessors. This work was performed at the NASA Center for Computational Sciences (NCCS) on behalf of the Carnegie Institution of Washington (CIW) as a study preliminary to the acquisition of a high-performance computing (HPC) system at the CIW and for the NCCS itself to measure the relative performance of a recently acquired Intel Nehalem-based system against previously installed multicore HPC resources. A brief overview of computer performance evaluation is provided, followed by a description of the systems under test, a description of the FLASH test problem, and the test results. Additionally, the paper characterizes some of the effects of load imbalance imposed by adaptive mesh refinement. Copyright © 2012 John Wiley & Sons, Ltd. (It is possible to distribute the same number of MPI processes across nodes in various ways. For instance, a 128-process job may be run on a system with four cores per node by allocating 32 nodes with four processes per node (fully populated nodes), 64 nodes with two processes per node (two idle cores per node), or 128 nodes with one process per node (three idle cores per node). Spreading the processes out as in the latter two cases often results in faster interprocess communication and better memory performance because there are fewer processes on a node using the same off-node bandwidth. However, because this approach prevents the additional nodes from being used by other jobs and lowers overall system throughput, its use is not encouraged on NCCS systems, and jobs are charged for use of the entire node even if some cores are idle.)