3D finite difference computation on GPUs using CUDA

Authors:
Paulius Micikevicius
Affiliations:
NVIDIA, Santa Clara, CA
Venue:
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Year:
2009

Citing 4
Cited 43

Implicit and explicit optimizations for stencil computations

Proceedings of the 2006 workshop on Memory system performance and correctness
Scalable Parallel Programming with CUDA

Queue - GPU Computing
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing

State-of-the-art in heterogeneous computing

Scientific Programming
Large-scale FFT on GPU clusters

Proceedings of the 24th ACM International Conference on Supercomputing
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster

Journal of Computational Physics
Asynchronous Communication Schemes for Finite Difference Methods on Multiple GPUs

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Assessment of GPU computational enhancement to a 2D flood model

Environmental Modelling & Software
Data layout transformation for stencil computations on short-vector SIMD architectures

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
The pochoir stencil compiler

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Mint: realizing CUDA performance in 3D stencil methods with annotated C

Proceedings of the international conference on Supercomputing
CudaDMA: optimizing GPU memory bandwidth via warp specialization

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Hardware/software co-design for energy-efficient seismic modeling

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Efficient probabilistic and geometric anatomical mapping using particle mesh approximation on GPUs

Journal of Biomedical Imaging - Special issue on Parallel Computation in Medical Imaging Applications
Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms

The Journal of Supercomputing
Cardiac simulation on multi-GPU platform

The Journal of Supercomputing
Automatic communication optimizations through memory reuse strategies

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Shallow water simulations on multiple GPUs

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Revisiting finite difference and spectral migration methods on diverse parallel architectures

Computers & Geosciences
Fast seismic modeling and reverse time migration on a graphics processing unit cluster

Concurrency and Computation: Practice & Experience
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Proceedings of the Tenth International Symposium on Code Generation and Optimization
High-performance code generation for stencil computations on GPU architectures

Proceedings of the 26th ACM international conference on Supercomputing
A Fourier integral algorithm and its GPU/CPU collaborative implementation for one-way wave equation migration

Computers & Geosciences
Tuning solution of large non-Hermitian linear systems on multiple graphics processing unit accelerated workstations

International Journal of High Performance Computing Applications
Forward and back substitution algorithms on GPU: a case study on modified incomplete Cholesky Preconditioner for three-dimensional finite difference method

The Journal of Supercomputing
Profile-guided floating- to fixed-point conversion for hybrid FPGA-processor applications

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Modeling irregular boundary with transmission-line modeling method

International Journal of RF and Microwave Computer-Aided Engineering
Multi-GPU implementation of the lattice Boltzmann method

Computers & Mathematics with Applications
Acceleration of stable TTI P-wave reverse-time migration with GPUs

Computers & Geosciences
GPU accelerated flow solver for direct numerical simulation of turbulent flows

Journal of Computational Physics
Direct numerical simulation of turbulence using GPU accelerated supercomputers

Journal of Computational Physics
Automating resource optimisation in reconfigurable design (abstract only)

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Multi-level parallelism for incompressible flow computations on GPU clusters

Parallel Computing
A peta-scalable CPU-GPU algorithm for global atmospheric simulations

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallel electronic structure calculations using multiple graphics processing units (GPUs)

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Steering and in-situ visualization for simulation of seismic wave propagation on graphics cards

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Vectorized higher order finite difference kernels

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Split tiling for GPUs: automatic parallelization using trapezoidal tiles

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Memory reuse optimizations in the R-Stream compiler

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
A stencil compiler for short-vector SIMD architectures

Proceedings of the 27th international ACM conference on International conference on supercomputing
3D seismic reverse time migration on GPGPU

Computers & Geosciences
Efficient 3D stencil computations using CUDA

Parallel Computing
Approximate inference for spatial functional data on massively parallel processors

Computational Statistics & Data Analysis
Accelerating Single Iteration Performance of CUDA-Based 3D Reaction---Diffusion Simulations

International Journal of Parallel Programming

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper we describe a GPU parallelization of the 3D finite difference computation using CUDA. Data access redundancy is used as the metric to determine the optimal implementation for both the stencil-only computation, as well as the discretization of the wave equation, which is currently of great interest in seismic computing. For the larger stencils, the described approach achieves the throughput of between 2,400 to over 3,000 million of output points per second on a single Tesla 10-series GPU. This is roughly an order of magnitude higher than a 4-core Harpertown CPU running a similar code from seismic industry. Multi-GPU parallelization is also described, achieving linear scaling with GPUs by overlapping inter-GPU communication with computation.