Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

Authors:
Tareq Malas;Aron J. Ahmadia;Jed Brown;John A. Gunnels;David E. Keyes
Affiliations:
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia;King Abdullah University of Science and Technology, Thuwal, Saudi Arabia;Argonne National Laboratory, Argonne, IL, USA;IBM T.J. Watson Research Center, Yorktown Heights, NY, USA;King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Venue:
International Journal of High Performance Computing Applications
Year:
2013

Citing 28
Cited 1

Estimating interlock and improving balance for pipelined architectures

Journal of Parallel and Distributed Computing
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Optimal instruction scheduling using integer programming

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Postpass Code Optimization of Pipeline Constraints

ACM Transactions on Programming Languages and Systems (TOPLAS)
Vectorization for SIMD architectures with alignment constraints

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Automatic tiling of iterative stencil loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Impact of modern memory subsystems on cache optimizations for stencil computations

Proceedings of the 2005 workshop on Memory system performance
Implicit and explicit optimizations for stencil computations

Proceedings of the 2006 workshop on Memory system performance and correctness
Sketching stencils

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Effective automatic parallelization of stencil computations

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
The Green500 List: Encouraging Sustainable Supercomputing

Computer
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors

Scientific Programming - High Performance Computing with the Cell Broadband Engine
IBM System Blue Gene Solution: Blue Gene/P Application Development

IBM System Blue Gene Solution: Blue Gene/P Application Development
High-order stencil computations on multicore clusters

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Parallel data-locality aware stencil computations on modern micro-architectures

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
A Multilevel Parallelization Framework for High-Order Stencil Computations

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization

COMPSAC '09 Proceedings of the 2009 33rd Annual IEEE International Computer Software and Applications Conference - Volume 01
Beyond homogeneous decomposition: scaling long-range forces on Massively Parallel Systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Introducing a performance model for bandwidth-limited loop kernels

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Auto-tuning stencil codes for cache-based multicore platforms

Auto-tuning stencil codes for cache-based multicore platforms
The International Exascale Software Project roadmap

International Journal of High Performance Computing Applications
Data layout transformation for stencil computations on short-vector SIMD architectures

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
The pochoir stencil compiler

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Automatic code generation and tuning for stencil kernels on modern shared memory architectures

Computer Science - Research and Development
Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization

A scalable, efficient scheme for evaluation of stencil computations over unstructured meshes

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution of partial differential equations, represents a challenge despite the regularity of memory access. Sophisticated optimization techniques are required to fully utilize the CPU. We propose a new method for constructing streaming numerical kernels using a high-level assembly synthesis and optimization framework. We describe an implementation of this method in Python targeting the IBM脗庐 Blue Gene脗庐/P supercomputer's PowerPC脗庐 450 core. This paper details the high-level design, construction, simulation, verification, and analysis of these kernels utilizing a subset of the CPU's instruction set. We demonstrate the effectiveness of our approach by implementing several three-dimensional stencil kernels over a variety of cached memory scenarios and analyzing the mechanically scheduled variants, including a 27-point stencil achieving a 1.7脙聴 speedup over the best previously published results.