Efficient 3D stencil computations using CUDA

Authors:
Marcin Krotkiewski;Marcin Dabrowski
Affiliations:
-;-
Venue:
Parallel Computing
Year:
2013

Citing 14
Cited 0

Linear algebra operators for GPU implementation of numerical algorithms

ACM SIGGRAPH 2003 Papers
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
3D finite difference computation on GPUs using CUDA

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Improving Performance of Matrix Multiplication and FFT on GPU

ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems
Parallel symmetric sparse matrix-vector product on scalar multi-core CPUs

Parallel Computing
Towards dense linear algebra for hybrid GPU accelerated manycore systems

Parallel Computing
On the limits of GPU acceleration

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
The pochoir stencil compiler

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Proceedings of the Tenth International Symposium on Code Generation and Optimization
High-performance code generation for stencil computations on GPU architectures

Proceedings of the 26th ACM international conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an efficient implementation of 7-point and 27-point stencils on high-end Nvidia GPUs. A new method of reading the data from the global memory to the shared memory of thread blocks is developed. The method avoids conditional statements and requires only two coalesced instructions to load the tile data with the halo (ghost zone). Additional optimizations include storing only one XY tile of data at a time in the shared memory to lower shared memory requirements, common subexpression elimination to reduce the number of instructions, and software prefetching to overlap arithmetic and memory instructions, and enhance latency hiding. The efficiency of our implementation is analyzed using a simple stencil memory footprint model that takes into account the actual halo overhead due to the minimum memory transaction size on the GPUs. Through experiments we demonstrate that in our implementation the memory overhead due to the halos is largely eliminated by good reuse of the halo data in the memory caches, and that our method of reading the data is close to optimal in terms of memory bandwidth usage. Detailed performance analysis for single precision stencil computations, and performance results for single and double precision arithmetic on two Tesla cards are presented. Our stencil implementations are more efficient than any other implementation described in the literature to date. On Tesla C2050 with single and double precision arithmetic our 7-point stencil achieves an average throughput of 12.3 and 6.5Gpts/s, respectively (98 GFLOP/s and 52 GFLOP/s, respectively). The symmetric 27-point stencil sustains a throughput of 10.9 and 5.8 Gpts/s, respectively.