High-order stencil computations on multicore clusters

Authors:
Liu Peng;Richard Seymour; Ken-ichi Nomura;Rajiv K. Kalia;Aiichiro Nakano;Priya Vashishta;Alexander Loddoch;Michael Netzband;William R. Volz;Chap C. Wong
Affiliations:
Collaboratory for Advanced Computing and Simulations, Department of Computer Science, Department of Physics&Astronomy, Department of Chemical Engineering&Material Science,;Collaboratory for Advanced Computing and Simulations, Department of Computer Science, Department of Physics&Astronomy, Department of Chemical Engineering&Material Science,;Collaboratory for Advanced Computing and Simulations, Department of Computer Science, Department of Physics&Astronomy, Department of Chemical Engineering&Material Science,;Collaboratory for Advanced Computing and Simulations, Department of Computer Science, Department of Physics&Astronomy, Department of Chemical Engineering&Material Science,;Collaboratory for Advanced Computing and Simulations, Department of Computer Science, Department of Physics&Astronomy, Department of Chemical Engineering&Material Science,;Collaboratory for Advanced Computing and Simulations, Department of Computer Science, Department of Physics&Astronomy, Department of Chemical Engineering&Material Science,;Technical Computing, Chevron ETC, Houston, TX 77002, USA;Technical Computing, Chevron ETC, Houston, TX 77002, USA;Technical Computing, Chevron ETC, Houston, TX 77002, USA;Technical Computing, Chevron ETC, Houston, TX 77002, USA
Venue:
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Year:
2009

Citing 0
Cited 6

The pochoir stencil compiler

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Exploiting hierarchical parallelisms for molecular dynamics simulation on multicore clusters

The Journal of Supercomputing
Parallelization and performance comparison of the conjugate gradient equation solver on multicore Cell and Xeon computers

Concurrency and Computation: Practice & Experience
Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters

The Journal of Supercomputing
Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

International Journal of High Performance Computing Applications
Semantics-preserving data layout transformations for improved vectorisation

Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stencil computation (SC) is of critical importance for broad scientific and engineering applications. However, it is a challenge to optimize complex, high-order SC on emerging clusters of multicore processors. We have developed a hierarchical SC parallelization framework that combines: (1) spatial decomposition based on message passing; (2) multithreading using critical section-free, dual representation; and (3) single-instruction multiple-data (SIMD) parallelism based on various code transformations. Our SIMD transformations include translocated statement fusion, vector composition via shuffle, and vectorized data layout reordering (e.g. matrix transpose), which are combined with traditional optimization techniques such as loop unrolling. We have thereby implemented two SCs of different characteristics—diagonally dominant, lattice Boltzmann method (LBM) for fluid flow simulation and highly off-diagonal (6-th order) finite-difference time-domain (FDTD) code for seismic wave propagation—on a Cell Broadband Engine (Cell BE) based system (a cluster of PlayStation3 consoles), a dual Intel quadcore platform, and IBM BlueGene/L and P. We have achieved high inter-node and intra-node (multithreading and SIMD) scalability for the diagonally dominant LBM: Weak-scaling parallel efficiency 0.978 on 131,072 BlueGene/P processors; strong-scaling multithreading efficiency 0.882 on 6 cores of Cell BE; and strong-scaling SIMD efficiency 0.780 using 4-element vector registers of Cell BE. Implementation of the high-order SC, on the contrary, is less efficient due to long-stride memory access and the limited size of the vector register file, which points out the need for further optimizations.