Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters

Authors:
Hikmet Dursun;Manaschai Kunaseth;Ken-Ichi Nomura;Jacqueline Chame;Robert F. Lucas;Chun Chen;Mary Hall;Rajiv K. Kalia;Aiichiro Nakano;Priya Vashishta
Affiliations:
Collaboratory for Advanced Computing and Simulations, Department of Computer Science, Department of Physics & Astronomy, Department of Chemical Engineering & Materials Science, University of South ...;Collaboratory for Advanced Computing and Simulations, Department of Computer Science, Department of Physics & Astronomy, Department of Chemical Engineering & Materials Science, University of South ...;Collaboratory for Advanced Computing and Simulations, Department of Computer Science, Department of Physics & Astronomy, Department of Chemical Engineering & Materials Science, University of South ...;Information Sciences Institute, University of Southern California, Marina del Rey, USA 90292;Information Sciences Institute, University of Southern California, Marina del Rey, USA 90292;School of Computing, University of Utah, Salt Lake City, USA 84112;School of Computing, University of Utah, Salt Lake City, USA 84112;Collaboratory for Advanced Computing and Simulations, Department of Computer Science, Department of Physics & Astronomy, Department of Chemical Engineering & Materials Science, University of South ...;Collaboratory for Advanced Computing and Simulations, Department of Computer Science, Department of Physics & Astronomy, Department of Chemical Engineering & Materials Science, University of South ...;Collaboratory for Advanced Computing and Simulations, Department of Computer Science, Department of Physics & Astronomy, Department of Chemical Engineering & Materials Science, University of South ...
Venue:
The Journal of Supercomputing
Year:
2012

Citing 23
Cited 0

Fortran at ten gigaflops: the connection machine convolution compiler

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Automatic optimization of communication in compiling out-of-core stencil codes

ICS '96 Proceedings of the 10th international conference on Supercomputing
Parallel methods and tools for predicting material properties

Computing in Science and Engineering
Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Compiling stencils in high performance Fortran

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
MPI-The Complete Reference, Volume 1: The MPI Core

MPI-The Complete Reference, Volume 1: The MPI Core
Address Code and Arithmetic Optimizations for Embedded Systems

ASP-DAC '02 Proceedings of the 2002 Asia and South Pacific Design Automation Conference
Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
A data locality optimizing algorithm

ACM SIGPLAN Notices - Best of PLDI 1979-1999
High-order FDTD methods via derivative matching for Maxwell's equations with material interfaces

Journal of Computational Physics
Cache oblivious stencil computations

Proceedings of the 19th annual international conference on Supercomputing
A new FDTD stencil for reduced numerical anisotropy in the computer modeling of wave phenomena: Research Articles

International Journal of RF and Microwave Computer-Aided Engineering
Acoustic Wave Propagation in Urban Environments

HPCMP-UGC '07 Proceedings of the 2007 DoD High Performance Computing Modernization Program Users Group Conference
Entering the petaflop era: the architecture and performance of Roadrunner

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
High-frequency simulations of global seismic wave propagation using SPECFEM3D_GLOBE on 62K processors

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
IBM System Blue Gene Solution: Blue Gene/P Application Development

IBM System Blue Gene Solution: Blue Gene/P Application Development
Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms

Journal of Parallel and Distributed Computing
High-order stencil computations on multicore clusters

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
A Multilevel Parallelization Framework for High-Order Stencil Computations

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization

COMPSAC '09 Proceedings of the 2009 33rd Annual IEEE International Computer Software and Applications Conference - Volume 01
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster

Journal of Computational Physics
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a scalable parallelization scheme for high-order stencil computations that also optimizes memory behavior on multicore clusters. Our multilevel approach combines: (i) inter-node parallelization via spatial decomposition; (ii) inter-core parallelization via multithreading and explicit non-uniform memory access (NUMA) control; (iii) data locality optimizations through auto-tuned tiling for efficient use of hierarchical memory; and (iv) register blocking and data parallelism via single-instruction multiple-data techniques to utilize registers and exploit data locality. The scheme is applied to a sixth-order stencil based finite-difference time-domain code. Weak-scaling parallel efficiency is over 98 % on 32,768 BlueGene/P processors. Multithreading with explicit NUMA control attains 9.9-fold speedup on a dual 12-core AMD Opteron system. Data locality optimizations achieve 7.7-fold reduction of the last level cache miss rate of Intel Nehalem, whereas register blocking increases data parallelism and thereby achieves 5.9 Gflops performance on a single core. Register blocking + multithreading optimizations achieve 5.8-fold speedup on a single quadcore Nehalem.