Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

  • Authors:
  • Tareq Malas;Aron J. Ahmadia;Jed Brown;John A. Gunnels;David E. Keyes

  • Affiliations:
  • King Abdullah University of Science and Technology, Thuwal, Saudi Arabia;King Abdullah University of Science and Technology, Thuwal, Saudi Arabia;Argonne National Laboratory, Argonne, IL, USA;IBM T.J. Watson Research Center, Yorktown Heights, NY, USA;King Abdullah University of Science and Technology, Thuwal, Saudi Arabia

  • Venue:
  • International Journal of High Performance Computing Applications
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution of partial differential equations, represents a challenge despite the regularity of memory access. Sophisticated optimization techniques are required to fully utilize the CPU. We propose a new method for constructing streaming numerical kernels using a high-level assembly synthesis and optimization framework. We describe an implementation of this method in Python targeting the IBM脗庐 Blue Gene脗庐/P supercomputer's PowerPC脗庐 450 core. This paper details the high-level design, construction, simulation, verification, and analysis of these kernels utilizing a subset of the CPU's instruction set. We demonstrate the effectiveness of our approach by implementing several three-dimensional stencil kernels over a variety of cached memory scenarios and analyzing the mechanically scheduled variants, including a 27-point stencil achieving a 1.7脙聴 speedup over the best previously published results.