3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Authors:
Anthony Nguyen;Nadathur Satish;Jatin Chhugani;Changkyu Kim;Pradeep Dubey
Affiliations:
-;-;-;-;-
Venue:
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2010

Citing 17
Cited 22

Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Implicit and explicit optimizations for stencil computations

Proceedings of the 2006 workshop on Memory system performance and correctness
The memory behavior of cache oblivious stencil computations

The Journal of Supercomputing
Scientific computing Kernels on the cell processor

International Journal of Parallel Programming
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Parallel Lattice Boltzmann Flow Simulation on Emerging Multi-core Platforms

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
3D finite difference computation on GPUs using CUDA

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms

Journal of Parallel and Distributed Computing
A Multilevel Parallelization Framework for High-Order Stencil Computations

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Efficient Temporal Blocking for Stencil Computations by Multicore-Aware Wavefront Parallelization

COMPSAC '09 Proceedings of the 2009 33rd Annual IEEE International Computer Software and Applications Conference - Volume 01
Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

SIAM Review
Implementation of a Lattice Boltzmann kernel using the Compute Unified Device Architecture developed by nVIDIA

Computing and Visualization in Science
Accelerating Lattice Boltzmann Fluid Flow Simulations Using Graphics Processors

ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
LBM based flow simulation using GPU computing processor

Computers & Mathematics with Applications
Auto-tuning stencil codes for cache-based multicore platforms

Auto-tuning stencil codes for cache-based multicore platforms

Performance characteristics of global high-resolution ocean (MPIOM) and atmosphere (ECHAM6) models on large-scale multicore cluster

PaCT'11 Proceedings of the 11th international conference on Parallel computing technologies
Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Toward high-throughput algorithms on many-core architectures

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Parallel simulation of dendritic growth on unstructured grids

Proceedings of the first workshop on Irregular applications: architectures and algorithm
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Proceedings of the Tenth International Symposium on Code Generation and Optimization
High-performance code generation for stencil computations on GPU architectures

Proceedings of the 26th ACM international conference on Supercomputing
Can traditional programming bridge the Ninja performance gap for parallel computing applications?

Proceedings of the 39th Annual International Symposium on Computer Architecture
Towards autotuning by alternating communication methods

ACM SIGMETRICS Performance Evaluation Review
Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters

The Journal of Supercomputing
Patus for convenient high-performance stencils: evaluation in earthquake simulations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Efficient backprojection-based synthetic aperture radar computation with many-core processors

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Optimization of geometric multigrid for emerging multi- and manycore processors

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Vectorized higher order finite difference kernels

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Split tiling for GPUs: automatic parallelization using trapezoidal tiles

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

International Journal of High Performance Computing Applications
Location-aware cache management for many-core processors with deep cache hierarchy

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Efficient 3D stencil computations using CUDA

Parallel Computing
Accelerating Single Iteration Performance of CUDA-Based 3D Reaction---Diffusion Simulations

International Journal of Parallel Programming
Efficient backprojection-based synthetic aperture radar computation with many-core processors

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

Stencil computation sweeps over a spatial grid over multiple time steps to perform nearest-neighbor computations. The bandwidth-to-compute requirement for a large class of stencil kernels is very high, and their performance is bound by the available memory bandwidth. Since memory bandwidth grows slower than compute, the performance of stencil kernels will not scale with increasing compute density. We present a novel 3.5D-blocking algorithm that performs 2.5D-spatial and temporal blocking of the input grid into on-chip memory for both CPUs and GPUs. The resultant algorithm is amenable to both thread- level and data-level parallelism, and scales near-linearly with the SIMD width and multiple-cores. Our performance numbers are faster or comparable to state-of-the-art-stencil implementations on CPUs and GPUs. Our implementation of 7-point-stencil is 1.5X-faster on CPUs, and 1.8X faster on GPUs for single- precision floating point inputs than previously reported numbers. For Lattice Boltzmann methods, the corresponding speedup number on CPUs is 2.1X.