Impact of Quad-Core Cray XT4 System and Software Stack on Scientific Computation

Authors:
S. R. Alam;R. F. Barrett;H. Jagode;J. A. Kuehn;S. W. Poole;R. Sankaran
Affiliations:
Oak Ridge National Laboratory, Oak Ridge, USA 37831;Oak Ridge National Laboratory, Oak Ridge, USA 37831;Oak Ridge National Laboratory, Oak Ridge, USA 37831;Oak Ridge National Laboratory, Oak Ridge, USA 37831;Oak Ridge National Laboratory, Oak Ridge, USA 37831;Oak Ridge National Laboratory, Oak Ridge, USA 37831
Venue:
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Year:
2009

Citing 9
Cited 0

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Low-storage, explicit Runge-Kutta schemes for the compressible Navier-Stokes equations

Applied Numerical Mathematics
A scalable cross-platform infrastructure for application performance tuning using hardware counters

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
OpenMP: An Industry-Standard API for Shared-Memory Programming

IEEE Computational Science & Engineering
Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems)

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Cray XT4: an early evaluation for petascale scientific simulation

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Early evaluation of IBM BlueGene/P

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Overview of the Blue Gene/L system architecture

IBM Journal of Research and Development
Early evaluation of the cray XT3

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

An upgrade from dual-core to quad-core AMD processor on the Cray XT system at the Oak Ridge National Laboratory (ORNL) Leadership Computing Facility (LCF) has resulted in significant changes in the hardware and software stack, including a deeper memory hierarchy, SIMD instructions and a multi-core aware MPI library. In this paper, we evaluate impact of a subset of these key changes on large-scale scientific applications. We will provide insights into application tuning and optimization process and report on how different strategies yield varying rates of successes and failures across different application domains. For instance, we demonstrate that the vectorization instructions (SSE) provide a performance boost of as much as 50% on fusion and combustion applications. Moreover, we reveal how the resource contentions could limit the achievable performance and provide insights into how application could exploit Petascale XT5 system's hierarchical parallelism.